打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
Text Classification for Sentiment Analysis – Stopwords and Collocations | StreamHacker
24May/1029

Text Classification for Sentiment Analysis Stopwords and Collocations

Hi! If you enjoy this post, you might want to subscribe to the RSS feed or follow me on Twitter here.

Improving feature extraction can often have a significant positive impact on classifier accuracy (and precision and recall). In this article, I'll be evaluating two modifications of the word_feats feature extraction method:

  1. filter out stopwords
  2. include bigram collocations

To do this effectively, we'll modify the previous code so that we can use an arbitrary feature extractor function that takes the words in a file and returns the feature dictionary. As before, we'll use these features to train a Naive Bayes Classifier.

import collectionsimport nltk.classify.util, nltk.metricsfrom nltk.classify import NaiveBayesClassifierfrom nltk.corpus import movie_reviewsdef evaluate_classifier(featx):	negids = movie_reviews.fileids('neg')	posids = movie_reviews.fileids('pos')	negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]	posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]	negcutoff = len(negfeats)*3/4	poscutoff = len(posfeats)*3/4	trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]	testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]	classifier = NaiveBayesClassifier.train(trainfeats)	refsets = collections.defaultdict(set)	testsets = collections.defaultdict(set)	for i, (feats, label) in enumerate(testfeats):			refsets[label].add(i)			observed = classifier.classify(feats)			testsets[observed].add(i)	print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)	print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])	print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])	print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])	print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])	classifier.show_most_informative_features()

Baseline Bag of Words Feature Extraction

Here's the baseline feature extractor for bag of words feature selection.

def word_feats(words):	return dict([(word, True) for word in words])evaluate_classifier(word_feats)

The results are the same as in the previous articles, but I've included them here for reference:

accuracy: 0.728pos precision: 0.651595744681pos recall: 0.98neg precision: 0.959677419355neg recall: 0.476Most Informative Features         magnificent = True              pos : neg    =     15.0 : 1.0         outstanding = True              pos : neg    =     13.6 : 1.0           insulting = True              neg : pos    =     13.0 : 1.0          vulnerable = True              pos : neg    =     12.3 : 1.0           ludicrous = True              neg : pos    =     11.8 : 1.0              avoids = True              pos : neg    =     11.7 : 1.0         uninvolving = True              neg : pos    =     11.7 : 1.0          astounding = True              pos : neg    =     10.3 : 1.0         fascination = True              pos : neg    =     10.3 : 1.0             idiotic = True              neg : pos    =      9.8 : 1.0

Stopword Filtering

Stopwords are words that are generally considered useless. Most search engines ignore these words because they are so common that including them would greatly increase the size of the index without improving precision or recall. NLTK comes with a stopwords corpus that includes a list of 128 english stopwords. Let's see what happens when we filter out these words.

from nltk.corpus import stopwordsstopset = set(stopwords.words('english'))def stopword_filtered_word_feats(words):	return dict([(word, True) for word in words if word not in stopset])evaluate_classifier(stopword_filtered_word_feats)

And the results for a stopword filtered bag of words are:

accuracy: 0.726pos precision: 0.649867374005pos recall: 0.98neg precision: 0.959349593496neg recall: 0.472

Accuracy went down .2%, and pos precision and neg recall dropped as well! Apparently stopwords add information to sentiment analysis classification. I did not include the most informative features since they did not change.

Bigram Collocations

As mentioned at the end of the article on precision and recall, it's possible that including bigrams will improve classification accuracy. The hypothesis is that people say things like "not great", which is a negative expression that the bag of words model could interpret as positive since it sees "great" as a separate word.

To find significant bigrams, we can use nltk.collocations.BigramCollocationFinder along with nltk.metrics.BigramAssocMeasures. The BigramCollocationFinder maintains 2 internal FreqDists, one for individual word frequencies, another for bigram frequencies. Once it has these frequency distributions, it can score individual bigrams using a scoring function provided by BigramAssocMeasures, such chi-square. These scoring functions measure the collocation correlation of 2 words, basically whether the bigram occurs about as frequently as each individual word.

import itertoolsfrom nltk.collocations import BigramCollocationFinderfrom nltk.metrics import BigramAssocMeasuresdef bigram_word_feats(words, score_fn=BigramAssocMeasures.chi_sq, n=200):	bigram_finder = BigramCollocationFinder.from_words(words)	bigrams = bigram_finder.nbest(score_fn, n)	return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])evaluate_classifier(bigram_word_feats)

After some experimentation, I found that using the 200 best bigrams from each file produced great results:

accuracy: 0.816pos precision: 0.753205128205pos recall: 0.94neg precision: 0.920212765957neg recall: 0.692Most Informative Features         magnificent = True              pos : neg    =     15.0 : 1.0         outstanding = True              pos : neg    =     13.6 : 1.0           insulting = True              neg : pos    =     13.0 : 1.0          vulnerable = True              pos : neg    =     12.3 : 1.0   ('matt', 'damon') = True              pos : neg    =     12.3 : 1.0      ('give', 'us') = True              neg : pos    =     12.3 : 1.0           ludicrous = True              neg : pos    =     11.8 : 1.0         uninvolving = True              neg : pos    =     11.7 : 1.0              avoids = True              pos : neg    =     11.7 : 1.0('absolutely', 'no') = True              neg : pos    =     10.6 : 1.0

Yes, you read that right, Matt Damon is apparently one of the best predictors for positive sentiment in movie reviews. But despite this chuckle-worthy result

  • accuracy is up almost 9%
  • pos precision has increased over 10% with only 4% drop in recall
  • neg recall has increased over 21% with just under 4% drop in precision

So it appears that the bigram hypothesis is correct, and including significant bigrams can increase classifier effectiveness. Note that it's significant bigrams that enhance effectiveness. I tried using nltk.util.bigrams to include all bigrams, and the results were only a few points above baseline. This points to the idea that including only significant features can improve accuracy compared to using all features. In a future article, I'll try trimming down the single word features to only include significant words.

本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
【热】打开小程序,算一算2024你的财运
nltk study
用50行Python代码对股市新闻进行情感分析
nltk.download()下载不了怎么办​??​
Weka开发[35]——StringToWordVector源代码分析(1)
solr使用教程四【面试+工作】
用Python进行金融市场文本数据的情感计算
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服