दोषपूर्ण यूनिकोड तारों को ठीक करना

एक दोषपूर्ण यूनिकोड स्ट्रिंग वह है जिसने गलती से बाइट एन्कोड किया है। उदाहरण के लिए:दोषपूर्ण यूनिकोड तारों को ठीक करना

पाठ: שלום, विंडोज-1255 एन्कोड: \x99\x8c\x85\x8d, यूनिकोड: u'\u05e9\u05dc\u05d5\u05dd', दोषपूर्ण यूनिकोड: u'\x99\x8c\x85\x8d'

मैं कभी कभी जब एमपी 3 फ़ाइलें में ID3 टैग पार्स करने में इस तरह के तार टक्कर। मैं इन तारों को कैसे ठीक कर सकता हूं? (जैसे u'\u05e9\u05dc\u05d5\u05dd' में u'\x99\x8c\x85\x8d' परिवर्तित)

स्रोत

2012-12-29 iTayb

आप '\x99\x8c\x85\x8d' करने के लिए u'\x99\x8c\x85\x8d' परिवर्तित latin-1 एन्कोडिंग का उपयोग कर सकते हैं:

In [9]: x = u'\x99\x8c\x85\x8d' 

In [10]: x.encode('latin-1') 
Out[10]: '\x99\x8c\x85\x8d'

हालांकि, ऐसा लगता है जैसे यह एक वैध विंडोज-1255 एन्कोड स्ट्रिंग नहीं है। क्या आपको शायद '\xf9\xec\xe5\xed' का मतलब था? यदि ऐसा है तो

In [22]: x = u'\xf9\xec\xe5\xed' 

In [23]: x.encode('latin-1').decode('cp1255') 
Out[23]: u'\u05e9\u05dc\u05d5\u05dd'

u'\xf9\xec\xe5\xed'u'\u05e9\u05dc\u05d5\u05dd' जो वांछित यूनिकोड आप पोस्ट से मेल खाता है करने के लिए बदल देता है।

In [27]: u'\x99\x8c\x85\x8d'.encode('latin-1').decode('cp862') Out[27]: u'\u05e9\u05dc\u05d5\u05dd'

ऊपर एन्कोडिंग/डिकोडिंग श्रृंखला इस स्क्रिप्ट का उपयोग पाया गया था:

तुम सच में u'\u05e9\u05dc\u05d5\u05dd' में u'\x99\x8c\x85\x8d' परिवर्तित करना चाहते हैं, तो इस काम के लिए होता guess_chain_encodings.py

""" 
Usage example: guess_chain_encodings.py "u'баба'" "u'\xe1\xe0\xe1\xe0'" 
""" 
import six 
import argparse 
import binascii 
import zlib 
import utils_string as us 
import ast 
import collections 
import itertools 
import random 

encodings = us.all_encodings() 

Errors = (IOError, UnicodeEncodeError, UnicodeError, LookupError, 
      TypeError, ValueError, binascii.Error, zlib.error) 

def breadth_first_search(text, all = False): 
    seen = set() 
    tasks = collections.deque() 
    tasks.append(([], text)) 
    while tasks: 
     encs, text = tasks.popleft() 
     for enc, newtext in candidates(text): 
      if repr(newtext) not in seen: 
       if not all: 
        seen.add(repr(newtext)) 
       newtask = encs+[enc], newtext 
       tasks.append(newtask) 
       yield newtask 

def candidates(text): 
    f = text.encode if isinstance(text, six.text_type) else text.decode 
    results = [] 
    for enc in encodings: 
     try: 
      results.append((enc, f(enc))) 
     except Errors as err: 
      pass 
    random.shuffle(results) 
    for r in results: 
     yield r 

def fmt(encs, text): 
    encode_decode = itertools.cycle(['encode', 'decode']) 
    if not isinstance(text, six.text_type): 
     next(encode_decode) 
    chain = '.'.join("{f}('{e}')".format(f = func, e = enc) 
        for enc, func in zip(encs, encode_decode)) 
    return '{t!r}.{c}'.format(t = text, c = chain) 

def main(): 
    parser = argparse.ArgumentParser() 
    parser.add_argument('start', type = ast.literal_eval, help = 'starting unicode') 
    parser.add_argument('stop', type = ast.literal_eval, help = 'ending unicode') 
    parser.add_argument('--all', '-a', action = 'store_true')  
    args = parser.parse_args() 
    min_len = None 
    for encs, text in breadth_first_search(args.start, args.all): 
     if min_len is not None and len(encs) > min_len: 
      break 
     if type(text) == type(args.stop) and text == args.stop: 
      print(fmt(encs, args.start)) 
      min_len = len(encs) 

if __name__ == '__main__': 
    main()

% guess_chain_encodings.py "u'\x99\x8c\x85\x8d'" "u'\u05e9\u05dc\u05d5\u05dd'" --all

रनिंग पैदावार

u'\x99\x8c\x85\x8d'.encode('latin_1').decode('cp862') 
u'\x99\x8c\x85\x8d'.encode('charmap').decode('cp862') 
u'\x99\x8c\x85\x8d'.encode('rot_13').decode('cp856')

आदि

स्रोत

2012-12-29 14:12:41 unutbu

rot_13 –

पर lol'd haha, मैं अजगर का दुभाषिया से यह मान लिया, और सकारात्मक था कि 'था Windows- 1255'। ओह अच्छा। – iTayb

दोषपूर्ण यूनिकोड तारों को ठीक करना

उत्तर

संबंधित मुद्दे