नियमित अभिव्यक्ति एक बाइनरी फ़ाइल पार्सिंग?

मेरे पास एक फ़ाइल है जो बाइनरी डेटा और टेक्स्ट डेटा को मिश्रित करती है। मैं एक नियमित अभिव्यक्ति के माध्यम से यह पार्स करने के लिए चाहते हैं, लेकिन मैं इस त्रुटि मिलती है:नियमित अभिव्यक्ति एक बाइनरी फ़ाइल पार्सिंग?

TypeError: can't use a string pattern on a bytes-like object

मुझे लगता है कि संदेश अनुमान लगा रहा हूँ मतलब है कि अजगर बाइनरी फ़ाइलें पार्स करने के लिए नहीं चाहता है। मैं "rb" झंडे के साथ फ़ाइल खोल रहा हूं।

मैं पाइथन में नियमित अभिव्यक्तियों के साथ बाइनरी फ़ाइलों को कैसे पार्स कर सकता हूं?

संपादित करें: मैं अजगर 3.2.0

स्रोत

2011-04-11 DonkeyMaster

मैं बाइट्स जैसी वस्तु है कि आप अजगर 3 का उपयोग कर रहे करने के लिए संदर्भ से अनुमान लगा रहा हूँ, यह सही है? –

मुझे लगता है कि आप पाइथन 3 का उपयोग करते हैं।

1.Opening a file in binary mode is simple but subtle. The only difference from opening it in text mode is that the mode parameter contains a 'b' character.

........

4.Here’s one difference, though: a binary stream object has no encoding attribute. That makes sense, right? You’re reading (or writing) bytes, not strings, so there’s no conversion for Python to do.

http://www.diveintopython3.net/files.html#read

फिर, अजगर 3 में, एक फ़ाइल से एक बाइनरी धारा के बाद से बाइट्स की एक धारा है, एक regex एक फ़ाइल से एक धारा का विश्लेषण करने के लिए एक बाइट अनुक्रम, नहीं एक charcaters अनुक्रम के साथ परिभाषित किया जाना चाहिए।

In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string — that is, an array of Unicode characters (of possibly varying byte lengths).

http://www.diveintopython3.net/case-study-porting-chardet-to-python-3.html

और

In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that.

http://www.diveintopython3.net/strings.html#boring-stuff

और

4.6. Strings vs. Bytes# Bytes are bytes; characters are an abstraction. An immutable sequence of Unicode characters is called a string. An immutable sequence of numbers-between-0-and-255 is called a bytes object.

....

1.To define a bytes object, use the b' ' “byte literal” syntax. Each byte within the byte literal can be an ASCII character or an encoded hexadecimal number from \x00 to \xff (0–255).

http://www.diveintopython3.net/strings.html#boring-stuff

तो आप अपने regex के रूप में

pat = re.compile(b'[a-f]+\d+')

और इस प्रकार परिभाषित करेगा नहीं

के रूप में

pat = re.compile('[a-f]+\d+')

अधिक यहाँ स्पष्टीकरण:

15.6.4. Can’t use a string pattern on a bytes-like object

स्रोत

2011-04-11 10:35:37 eyquem

उपरोक्त क्योंकि यह भविष्य के संदर्भ के लिए _why_ बताता है। मैं जानता हूं कि एक एन्कोडिंग क्या है, और आपकी पोस्ट बहुत वर्बोज़ थी, इमो, हालांकि अंत में आप मुझे जो जवाब चाहते थे उसे देते हैं। – DonkeyMaster

एक संकेत लें! -) –

@ जॉन Machin आप क्या मतलब है, कृपया? – eyquem

-2

यह अजगर 2,6

>>> import re 
>>> r = re.compile(".*(ELF).*") 
>>> f = open("/bin/ls") 
>>> x = f.readline() 
>>> r.match(x).groups() 
('ELF',)

स्रोत

2011-04-11 09:13:18

यह कोड 'आयात फिर से; आर = re.compile ("(यह)"); एफ = ओपन (आर "सी: \ विन्डोज़ \ system32 \ mspaint.exe", "आरबी"); एक्स = एफ .readline(); r.match (x)। समूह() ' मेरी मूल पोस्ट – DonkeyMaster

के लिए मेरे लिए काम कर रहा है में उपयोग कर रहा हूँ अपने re.compile आप एक bytes वस्तु, एक प्रारंभिक द्वारा संकेतित उपयोग करने की आवश्यकता b:

r = re.compile(b"(This)")

यह पायथन 3 स्ट्रिंग के बीच अंतर के बारे में पसंद है एस और बाइट्स।

स्रोत

2011-04-11 10:19:46

के समान त्रुटि देता है यह उत्तर मुझे सही ट्रैक पर रखता है, बहुत बहुत धन्यवाद। – DonkeyMaster

नियमित अभिव्यक्ति एक बाइनरी फ़ाइल पार्सिंग?

उत्तर

संबंधित मुद्दे