Sentiment Analysis of Movie Reviews

Siddhi Thakur
6 min readJan 14, 2019

--

Text Analysis and Review classification in Python

This is a short Tutorial presenting a minimal Text Analysis and classification of Reviews, Where we are classifying the labels as Positive and Negative based reviews. Please find below link to download the dataset for movie reviews.

Data

http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

Let’s begin the simple classification model using Python’s Pandas, NLTK and Scikit-learn libraries.

Unpack the downloaded files and load it in our code. The folder structure will be as txt_sentoken>pos and txt_sentoken>neg where pos and neg and folders containing text files of positive and negative reviews.

import numpy as np
import re
import nltk
from sklearn.datasets import load_files

movie_data = load_files(r”C:\Users\user\Desktop\kaggle practice\txt_sentoken”)
X, y = movie_data.data, movie_data.target

This loads the data and target into X and y respectively, where X and y are lists in python. For better understanding of sklearn.load_files

Now that we have the data loaded, we need to clean and preprocess the data.
As we can see this example the text in the review is too messy.

[b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . \nwith the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! \nparts of this are actually so absurd , that they would fit right in with dogma . \nyes , the film is that weak , but it's better than the other blockbuster right now ( sleepy hollow ) , but it makes the world is not enough look like a 4 star film . \nanyway , this definitely doesn't seem like an arnold movie . \nit just wasn't the type of film you can see him doing . \nsure he gave us a few chuckles with his well known one-liners , but he seemed confused as to where his character and the film was going . \nit's understandable , especially when the ending had to be changed according to some sources . \naside form that , he still walked through it , much like he has in the past few films . \ni'm sorry to say this arnold but maybe these are the end of your action days . \nspeaking of action , where was it in this film ? \nthere was hardly any explosions or fights . \nthe devil made a few places explode , but arnold wasn't kicking some devil butt . \nthe ending was changed to make it more spiritual , which undoubtedly ruined the film . \ni was at least hoping for a cool ending if nothing else occurred , but once again i was let down . \ni also don't know why the film took so long and cost so much . \nthere was really no super affects at all , unless you consider an invisible devil , who was in it for 5 minutes tops , worth the overpriced budget . \nthe budget should have gone into a better script , where at least audiences could be somewhat entertained instead of facing boredom . \nit's pitiful to see how scripts like these get bought and made into a movie . \ndo they even read these things anymore ? \nit sure doesn't seem like it . \nthankfully gabriel's performance gave some light to this poor film . \nwhen he walks down the street searching for robin tunney , you can't help but feel that he looked like a devil . \nthe guy is creepy looking anyway ! \nwhen it's all over , you're just glad it's the end of the movie . \ndon't bother to see this , if you're expecting a solid action flick , because it's neither solid nor does it have action . \nit's just another movie that we are suckered in to seeing , due to a strategic marketing campaign . \nsave your money and see the world is not enough for an entertaining experience . \n"]

Data Preprocessing

With the help of regex and nltk library we can get a clean and processed useful text.Here it is how we do it.

We use the port stemmer to remove morphological affixes from words, leaving only the word stem.This can be imported as follows:

from nltk.stem.porter import PorterStemmer

corpus=[]
for i in range(0, len(X)):
review = re.sub(‘[^a-zA-Z]’, ‘ ‘, str(X[i]))

review= re.sub(‘^b\s+’, ‘’, review)
review= re.sub(‘\s+’, ‘ ‘, review)

review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words(‘english’))]
review = ‘ ‘.join(review)
corpus.append(review)

For each review, we first need to remove characters like numeric characters, punctuations etc. Secondly the data that we have taken contains ‘/b’ at the starting of each review. So when we clean the data we need to get rid of these characters that are of not much use.
Along with data cleaning we need to remove the stopwords and get the stem words only. From the list of words in the review we remove the stopwords and get the stemword.

The idea of stemming is a sort of normalizing method. Many variations of words carry the same meaning, other than when tense is involved.

The reason why we stem is to shorten the lookup, and normalize sentence

Finally, we have list of reviews that is the corpus which has the stemmed non stopword words from each review.

Vectorisation

Before we begin with the vectorization, let’s understand the need.

At the moment, we have our reviews as lists of tokens (also known as lemmas). To enable Scikit-learn algorithms to work on our text, we need to convert each review into a vector.We can use Scikit-learn’s CountVectorizer to convert the text collection into a matrix of token counts.

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words(‘english’))
X = vectorizer.fit_transform(corpus).toarray()

from sklearn.feature_extraction.text import TfidfTransformer
tfidfconverter = TfidfTransformer()
X = tfidfconverter.fit_transform(X).toarray()
The TfidfTransformer transforms a count matrix to a normalized tf or tf-idf representation. So although both the CountVectorizer and TfidfTransformer produce term frequencies, TfidfTransformer is normalizing the count.

Splitting the dataset

Now that we have everything ready, let’s split our data into train and test data using the train_test_split of Scikit-learn

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Training and testing the model

from sklearn.naive_bayes import MultinomialNB
classifier= MultinomialNB()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

Try different models and check the accuracy for each.

To evaluate our predictions against the actual ratings (stored in y_test) using confusion_matrix and classification_report from Scikit-learn.

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

MultinomialNB classifier

Looks like our model has achieved 78% accuracy

Using Random Forest classifier gives us the accuracy of 84%

from sklearn.ensemble import RandomForestClassifier

classifier1 = RandomForestClassifier(n_estimators=220, random_state=0)
classifier1.fit(X_train, y_train)

y_pred1 = classifier1.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,y_pred1))
print(classification_report(y_test,y_pred1))
print(accuracy_score(y_test, y_pred1))

Random Forest classifier

Conclusion

Thus as you guys can see, it’s pretty simple and short code to acquire a basic sentiment analysis of the people’s reviews. This kind of algorithms can beapplied to almost any dataset of reviews, whether it is an App, or a Game or Series or Any Commercial Product.

There are also further complex algos where dataset is not so clean and pre processing is hectic. We will come to all that in future.

Till then, keep working, stay passionate.

Get in touch for recommendations or help Siddhi Thakur

--

--