Sentiment Analysis of Movie Reviews in NLTK Python
A gentle introduction to sentiment analysis
Sentiment Analysis is the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic, product, etc. is positive, negative, or neutral.
This is a simple project of classifying the movie reviews as either positive or negative. We would be working on the ‘movie_reviews’ dataset in ntlk.corpus package.
The necessary packages are imported.
import nltk
from nltk.corpus import movie_reviews
Let us explore the package ‘movie_reviews’.
# A list of all the words in 'movie_reviews'movie_reviews.words()
The output is [‘plot’, ‘:’, ‘two’, ‘teen’, ‘couples’, ‘go’, ‘to’, …]
# Prints total number of words in 'movie_reviews'len(movie_reviews.words())
Total number of words is 1583820.
movie_reviews.categories()
The output is [‘neg’, ‘pos’]. There are two categories :- ‘neg’ and ‘pos’ which implies negative and positive reviews respectively.
# Displays frequency of words in ‘movie_reviews’nltk.FreqDist(movie_reviews.words())
# Prints frequency of the word 'happy'.nltk.FreqDist(movie_reviews.words())['happy']
The frequency of the word ‘happy’ is 15.
# Displays frequency of 15 most common words in ‘movie_reviews’nltk.FreqDist(movie_reviews.words()).most_common(15)
The punctuation characters [‘,’ , ‘.’ , ‘-’] are treated as separate words.
Each of the movie review has a file id associated with it. The file id is the identification factor of movie review.
# Prints all file idsmovie_reviews.fileids()# Prints file ids of positive reviewsmovie_reviews.fileids(‘pos’)# Prints file ids of negative reviews.movie_reviews.fileids(‘neg’)
The words in a particular movie review can be printed if the file id is known.
# Prints all words in movie_review with file id ‘neg/cv001_19502.txt’movie_reviews.words(‘neg/cv001_19502.txt’)
The output is [‘the’, ‘happy’, ‘bastard’, “‘“, ‘s’, ‘quick’, ‘movie’, …].
Now that we explored, let us get into the project.
# all_words is a dictionary which contains the frequency of words in ‘movie_reviews’all_words = nltk.FreqDist(movie_reviews.words())
len(all_words) = 39768 gives the total number of distinct words in ‘movie_reviews’
We define a ‘feature_vector’ which contains the first 4000 words of ‘all_words’. We don’t include all the words in ‘all_words’ to save computational power and time and thus we compromise on results.
# Defining the feature_vectorfeature_vector = list(all_words)[:4000]
Let us try to manually analyze the sentiment of one movie review.
# Initializationfeature = {}# One movie review is chosenreview = movie_reviews.words(‘neg/cv954_19932.txt’)# ‘True’ is assigned if word in feature_vector can also be found in review. Otherwise ‘False’for x in range(len(feature_vector)):
feature[feature_vector[x]] = feature_vector[x] in review# The words which are assigned ‘True’ are printed[x for x in feature_vector if feature[x] == True]
Negative words like ‘unfortunately’ ,’accidentally’, can be seen in the output which suggests the sentiment is ‘negative’. Better results can be achieved if the 4000 words defined in ‘feature_vector’ be on extreme ends (either positive or negative).
Now let us try to analyze the sentiments of movie reviews using Machine Learning.
# Document is a list of (words of review, category of review)document = [(movie_reviews.words(file_id),category) for file_id in movie_reviews.fileids() for category in movie_reviews.categories(file_id)]document
# we define a function that finds the featuresdef find_feature(word_list):# Initializationfeature = {}# For loop to find the feature. ‘True’ is assigned if word in feature_vector can also be found in review. Otherwise ‘False’for x in feature_vector:
feature[x] = x in word_listreturn feature
Let us check the function ‘find_feature’.
# Checking the function ‘find_feature’find_feature(document[0][0])
document[0][0] is the list of words of first movie review.
The ‘find_feature’ function works fine.
# Feature_sets stores the ‘feature’ of every reviewfeature_sets = [(find_feature(word_list),category) for (word_list,category) in document]
Now we create machine learning models.
# The necessary packages and classifiers are importedfrom nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC
from sklearn import model_selection
The data set is split into Training and Test sets.
# Splitting into training and testing setstrain_set,test_set = model_selection.train_test_split(feature_sets,test_size = 0.25)
We check the size of Train and Test sets.
print(len(train_set))
print(len(test_set))
train_set is of size 1500 and test_set is of size 500.
We train the model on Training set.
# The model is trained on Training data.model = SklearnClassifier(SVC(kernel = ‘linear’))
model.train(train_set)
We test the trained model on Test set.
# The trained model is tested on Testing data and accuracy is calculatedaccuracy = nltk.classify.accuracy(model, test_set)print(‘SVC Accuracy : {}’.format(accuracy))
The Test set accuracy is 63%.
The test set accuracy can be further improved by choosing a more appropriate ‘feature_vector’, increasing the size of ‘feature_vector’ (only 4000 words have been included to save computational power and time), SVM Hyper parameter tuning and combining multiple classification algorithms.
Happy Learning :)