पायथन पांडा ..Series.str.contains पूर्ण शब्द

डीएफ (पांडस डेटाफ्रेम) में तीन पंक्तियां हैं।पायथन पांडा ..Series.str.contains पूर्ण शब्द

col_name 
"This is Donald." 
"His hands are so small" 
"Why are his fingers so short?"

मैं उस पंक्ति को निकालना चाहता हूं जिसमें "है" और "छोटा" है।

अगर मैं

df.col_name.str.contains("is|small", case=False)

करते तो यह फैल जाती है "उनकी" अच्छी तरह से जो के रूप में मैं नहीं चाहता कि चाहते हैं।

नीचे दिए गए प्रश्न df.series में पूरे शब्द को पकड़ने का सही तरीका है?

df.col_name.str.contains("\bis\b|\bsmall\b", case=False)

स्रोत

2016-09-07 Aaron

नहीं है, regex /bis/b|/bsmall/b क्योंकि आप /b उपयोग कर रहे हैं असफल हो जायेगी, \b मतलब यह है कि "शब्द सीमा" नहीं।

इसे बदलें और आपको एक मैच मिल जाएगा। मैं

\b(is|small)\b

का उपयोग करने की अनुशंसा करता हूं कि रेगेक्स थोड़ा तेज़ और थोड़ा और सुगम है, कम से कम मेरे लिए।

स्रोत

2016-09-07 00:54:05 Laurel

धन्यवाद। मैं आपके बिंदु/बी -> \ b परिलक्षित करता हूं। फिर भी पूरे शब्द को पकड़ने का कोई और तरीका देखने के लिए कुछ और दिनों का इंतजार करना चाहते हैं। – Aaron

संयोग से, मुझे इसे काम करने के लिए स्ट्रिंग से पहले 'आर' जोड़ना पड़ा: किसी को पता है क्यों? मुझे इसका कोई संदर्भ नहीं मिला है .. – mccc

अच्छी तरह से, स्पष्ट रूप से '|' char इसे एक रेगेक्स में स्पष्ट रूप से बनाता है, जबकि '\ b' नहीं .. – mccc

आपका तरीका (साथ/बी) मेरे लिए काम नहीं करता है। मुझे यकीन नहीं है कि आप लॉजिकल ऑपरेटर का उपयोग क्यों नहीं कर सकते हैं और() क्योंकि मुझे लगता है कि आप वास्तव में यही चाहते हैं।

यह यह करने के लिए एक मूर्खतापूर्ण तरीका है, लेकिन यह काम करता है:

mask = lambda x: ("is" in x) & ("small" in x) 
series_name.apply(mask)

स्रोत

2016-09-07 00:43:55 szeitlin

बिंदु पूरे शब्द को पकड़ने के लिए है:

>>> df.loc[df.match, 'col_name'] # Output: # 1 His hands are so small # Name: col_name, dtype: object

बूलियन अनुक्रमण का उपयोग कर सभी एक ही बयान में यह करने के लिए है (ओ) उसका (x) – Aaron

आपके द्वारा दिया गया उदाहरण उस संबंध में भ्रमित है, हालांकि मुझे लगता है कि आपने इसे थोड़ा स्पष्ट बनाने के लिए इसे फिर से लिखा है। यह हल करता है कि आपने मूल रूप से कहा था कि समस्या "मैं उस पंक्ति को निकालना चाहता हूं जिसमें" है "और" छोटा "है।" – szeitlin

सबसे पहले, आप सबकुछ को लोअरकेस में परिवर्तित करना, विराम चिह्न और सफेद जगह को हटाने और फिर परिणाम को शब्दों के एक समूह में परिवर्तित करना चाहते हैं।

import string 

df['words'] = [set(words) for words in 
    df['col_name'] 
    .str.lower() 
    .str.replace('[{0}]*'.format(string.punctuation), '') 
    .str.strip() 
    .str.split() 
] 

>>> df 
         col_name        words 
0    This is Donald.     {this, is, donald} 
1   His hands are so small   {small, his, so, are, hands} 
2 Why are his fingers so short? {short, fingers, his, so, are, why}

अब आप यह देखने के लिए बूलियन इंडेक्सिंग का उपयोग कर सकते हैं कि आपके सभी लक्षित शब्द इन नए शब्द सेट में हैं या नहीं।

target_words = ['is', 'small'] 
# Convert target words to lower case just to be safe. 
target_words = [word.lower() for word in target_words] 

df['match'] = df.words.apply(lambda words: all(target_word in words 
               for target_word in target_words)) 


print(df) 
# Output: 
#       col_name        words match 
# 0    This is Donald.     {this, is, donald} False 
# 1   His hands are so small   {small, his, so, are, hands} False 
# 2 Why are his fingers so short? {short, fingers, his, so, are, why} False  

target_words = ['so', 'small'] 
target_words = [word.lower() for word in target_words] 

df['match'] = df.words.apply(lambda words: all(target_word in words 
               for target_word in target_words)) 

print(df) 
# Output: 
# Output: 
#       col_name        words match 
# 0    This is Donald.     {this, is, donald} False 
# 1   His hands are so small   {small, his, so, are, hands} True 
# 2 Why are his fingers so short? {short, fingers, his, so, are, why} False

निकालने के लिए मिलान पंक्तियाँ:

df.loc[[all(target_word in word_set for target_word in target_words) 
     for word_set in (set(words) for words in 
         df['col_name'] 
         .str.lower() 
         .str.replace('[{0}]*'.format(string.punctuation), '') 
         .str.strip() 
         .str.split())], :]

स्रोत

2016-09-07 01:05:29 Alexander

उत्तर के लिए धन्यवाद .. मैं पांडस के इनबिल्ट इंडेक्सिंग का उपयोग करने की कोशिश कर रहा हूं (क्योंकि मेरी तालिका में लगभग 500k पंक्तियां हैं) लेकिन मुझे लगता है कि आप इसे अपने आप अनुक्रमणित कर रहे हैं ...? – Aaron

सुनिश्चित नहीं है कि आपका क्या मतलब है। यह पांडस इंडेक्सिंग का उपयोग करता है। – Alexander

यह एक मैच वापस करेगा लेकिन पूरे स्ट्रिंग मैच नहीं! –

पायथन पांडा ..Series.str.contains पूर्ण शब्द

उत्तर

संबंधित मुद्दे