Sentiment Analysis of Movie Reviews in NLTK Python

A gentle introduction to sentiment analysis

S Joel Franklin
4 min readJan 1, 2020
Image by tookapic from Pixabay

Sentiment Analysis is the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic, product, etc. is positive, negative, or neutral.

This is a simple project of classifying the movie reviews as either positive or negative. We would be working on the ‘movie_reviews’ dataset in ntlk.corpus package.

The necessary packages are imported.

import nltk
from nltk.corpus import movie_reviews

Let us explore the package ‘movie_reviews’.

# A list of all the words in 'movie_reviews'movie_reviews.words()

The output is [‘plot’, ‘:’, ‘two’, ‘teen’, ‘couples’, ‘go’, ‘to’, …]

# Prints total number of words in 'movie_reviews'len(movie_reviews.words())

Total number of words is 1583820.

movie_reviews.categories()

The output is [‘neg’, ‘pos’]. There are two categories :- ‘neg’ and ‘pos’ which implies negative and positive reviews respectively.

# Displays frequency of words in ‘movie_reviews’nltk.FreqDist(movie_reviews.words())
Frequency of a sample of words in ‘movie_reviews’
# Prints frequency of the word 'happy'.nltk.FreqDist(movie_reviews.words())['happy']

The frequency of the word ‘happy’ is 15.

# Displays frequency of 15 most common words in ‘movie_reviews’nltk.FreqDist(movie_reviews.words()).most_common(15)

The punctuation characters [‘,’ , ‘.’ , ‘-’] are treated as separate words.

Each of the movie review has a file id associated with it. The file id is the identification factor of movie review.

# Prints all file idsmovie_reviews.fileids()# Prints file ids of positive reviewsmovie_reviews.fileids(‘pos’)# Prints file ids of negative reviews.movie_reviews.fileids(‘neg’)

The words in a particular movie review can be printed if the file id is known.

# Prints all words in movie_review with file id ‘neg/cv001_19502.txt’movie_reviews.words(‘neg/cv001_19502.txt’)

The output is [‘the’, ‘happy’, ‘bastard’, “‘“, ‘s’, ‘quick’, ‘movie’, …].

Now that we explored, let us get into the project.

# all_words is a dictionary which contains the frequency of words in ‘movie_reviews’all_words = nltk.FreqDist(movie_reviews.words())

len(all_words) = 39768 gives the total number of distinct words in ‘movie_reviews’

We define a ‘feature_vector’ which contains the first 4000 words of ‘all_words’. We don’t include all the words in ‘all_words’ to save computational power and time and thus we compromise on results.

# Defining the feature_vectorfeature_vector = list(all_words)[:4000]

Let us try to manually analyze the sentiment of one movie review.

# Initializationfeature = {}# One movie review is chosenreview = movie_reviews.words(‘neg/cv954_19932.txt’)# ‘True’ is assigned if word in feature_vector can also be found in review. Otherwise ‘False’for x in range(len(feature_vector)):
feature[feature_vector[x]] = feature_vector[x] in review
# The words which are assigned ‘True’ are printed[x for x in feature_vector if feature[x] == True]
Entire word list is not displayed due to screenshot constraint

Negative words like ‘unfortunately’ ,’accidentally’, can be seen in the output which suggests the sentiment is ‘negative’. Better results can be achieved if the 4000 words defined in ‘feature_vector’ be on extreme ends (either positive or negative).

Now let us try to analyze the sentiments of movie reviews using Machine Learning.

# Document is a list of (words of review, category of review)document = [(movie_reviews.words(file_id),category) for file_id in movie_reviews.fileids() for category in movie_reviews.categories(file_id)]document
Entire list could not be displayed due to screenshot constraint
# we define a function that finds the featuresdef find_feature(word_list):# Initializationfeature = {}# For loop to find the feature. ‘True’ is assigned if word in feature_vector can also be found in review. Otherwise ‘False’for x in feature_vector:
feature[x] = x in word_list
return feature

Let us check the function ‘find_feature’.

# Checking the function ‘find_feature’find_feature(document[0][0])

document[0][0] is the list of words of first movie review.

All the words could not be displayed due to screen shot constraint

The ‘find_feature’ function works fine.

# Feature_sets stores the ‘feature’ of every reviewfeature_sets = [(find_feature(word_list),category) for (word_list,category) in document]

Now we create machine learning models.

# The necessary packages and classifiers are importedfrom nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC
from sklearn import model_selection

The data set is split into Training and Test sets.

# Splitting into training and testing setstrain_set,test_set = model_selection.train_test_split(feature_sets,test_size = 0.25)

We check the size of Train and Test sets.

print(len(train_set))
print(len(test_set))

train_set is of size 1500 and test_set is of size 500.

We train the model on Training set.

# The model is trained on Training data.model = SklearnClassifier(SVC(kernel = ‘linear’))
model.train(train_set)

We test the trained model on Test set.

# The trained model is tested on Testing data and accuracy is calculatedaccuracy = nltk.classify.accuracy(model, test_set)print(‘SVC Accuracy : {}’.format(accuracy))

The Test set accuracy is 63%.

The test set accuracy can be further improved by choosing a more appropriate ‘feature_vector’, increasing the size of ‘feature_vector’ (only 4000 words have been included to save computational power and time), SVM Hyper parameter tuning and combining multiple classification algorithms.

Happy Learning :)

--

--

S Joel Franklin
S Joel Franklin

Written by S Joel Franklin

Data Scientist | Fitness enthusiast | Avid traveller | Happy Learning