पायथन nltk: डॉट से अलग शब्दों के बिना कोलोशेशन खोजें

मैं अंतर्निहित विधि का उपयोग करके टेक्स्ट में एनएलटीके के साथ कोलाकेशन खोजने की कोशिश कर रहा हूं।पायथन nltk: डॉट से अलग शब्दों के बिना कोलोशेशन खोजें

अब मैं निम्न उदाहरण पाठ हो रहा है (परीक्षण और foo एक दूसरे के पीछे है, लेकिन वहाँ के बीच में एक वाक्य सीमा है):

content_part = """test. foo 0 test. foo 1 test. 
       foo 2 test. foo 3 test. foo 4 test. foo 5"""

tokenization और collocations() से परिणाम निम्नानुसार है:

print nltk.word_tokenize(content_part) 
# ['test.', 'foo', 'my', 'test.', 'foo', '1', 'test.', 
# 'foo', '2', 'test.', 'foo', '3', 'test.', 'foo', '4', 'test.', 'foo', '5'] 

print nltk.Text(nltk.word_tokenize(content_part)).collocations() 
# test. foo

कैसे कर सकते हैं मेरी tokenization में डॉट

नहीं collocations() की सजा सीमाओं से अधिक लगता है भी शामिल है

: मैं से NLTK को रोकने?

तो इस उदाहरण में यह बिल्कुल किसी भी मोरचा मुद्रित नहीं करना चाहिए, लेकिन मुझे लगता है कि आप और अधिक जटिल ग्रंथों जहां भी वाक्य भीतर collocations हैं कल्पना कर सकते हैं।

मुझे लगता है कि कर सकते हैं कि मैं segmenter पंक्ट वाक्य उपयोग करने की आवश्यकता है, लेकिन फिर मैं नहीं जानता कि कैसे उन्हें एक साथ फिर से nltk साथ collocations खोजने के लिए डाल करने के लिए (collocation() अपने आप को सिर्फ गिनती सामान की तुलना में अधिक शक्तिशाली हो रहा है)।

स्रोत

2012-02-05 Aufziehvogel

आप WordPunctTokenizer का उपयोग शब्दों से विराम चिह्न को अलग करने के लिए कर सकते हैं और बाद में big_s को apply_word_filter() के साथ विराम चिह्न के साथ फ़िल्टर कर सकते हैं।

वाक्य सीमाओं पर कॉलोकेशन न ढूंढने के लिए ट्रिग्राम के लिए समान चीज़ का उपयोग किया जा सकता है।

from nltk import bigrams 
from nltk import collocations 
from nltk import FreqDist 
from nltk.collocations import * 
from nltk import WordPunctTokenizer 

content_part = """test. foo 0 test. foo 1 test. 
       foo 2 test. foo 3 test. foo 4 test, foo 4 test.""" 

tokens = WordPunctTokenizer().tokenize(content_part) 

bigram_measures = collocations.BigramAssocMeasures() 
word_fd = FreqDist(tokens) 
bigram_fd = FreqDist(bigrams(tokens)) 
finder = BigramCollocationFinder(word_fd, bigram_fd) 

finder.apply_word_filter(lambda w: w in ('.', ',')) 

scored = finder.score_ngrams(bigram_measures.raw_freq) 

print tokens 
print sorted(finder.nbest(bigram_measures.raw_freq,2),reverse=True)

आउटपुट:

['test', '.', 'foo', '0', 'test', '.', 'foo', '1', 'test', '.', 'foo', '2', 'test', '.', 'foo', '3', 'test', '.', 'foo', '4', 'test', ',', 'foo', '4', 'test', '.'] 
[('4', 'test'), ('foo', '4')]

स्रोत

2012-02-07 20:39:00 wishiknew

पायथन nltk: डॉट से अलग शब्दों के बिना कोलोशेशन खोजें

उत्तर

संबंधित मुद्दे