Sentiment Analysis —How to estimate the rating from movie review

A process of computationally identifying and categorizing opinions expressed in a piece of text

Darshan Adakane
Analytics Vidhya
5 min readMay 28, 2021

--

Examples of predicted rating from the movie review submitted

Introduction:

When was the last time you watched a movie and submitted the review? Or had a nice dinner in a good restaurant and submitted the feedback? How about this. You submit your review and electronically the rating is generated. Cool right!

Welcome to the field of Natural Language Processing (NLP) and more specifically — ‘Sentiment Analysis’. Also known as opinion mining or emotion AI, sentiment analysis refers to the use of natural language processing, text analysis to systematically identify, extract, quantify subjective information[1]. In this article, we will try and achieve the same objective i.e. generate movie rating from the review submitted.

Pre-requisites:

  1. Theoretical knowledge on Sentiment Analysis and concepts like RNN, LSTM, GRU, Word Embeddings, etc. All these are covered in the fantastic course by Andrew Ng’s Sequence Models
  2. Knowledge of Python + Any code editor (Here we will use Jupyter notebook)

Coding:

Assuming pre-requisite knowledge on the topic, our motive is to first train the model on a dictionary that has phrases/individual words (size=#100000) and corresponding sentiment(rating). After training, we will test it with the new unseen phrase/words submitted in review and observe what rating it gives! As simple as that. I will try to describe the step-by-step coding.
First, let’s create a new Jupyter notebook file. At this root location where our file is, create a folder named ‘input’. Now download the training data and files. i.e. train.tsv, test.tsv. And sampleSubmission.csv is our output file. Our folder structure looks like this.
-sentimentanalysis.py
-input
-train.tsv
-test.tsv
-sampleSubmission.csv

Let’s begin coding. First, we will import the libraries needed.

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language. TQDM is a progress bar library with good support for nested loops and Jupyter/IPython notebooks.

If you observe the output here (which is our training data), the review sentence is broken down into each individual word called phrase and represented by PhraseId, and the sentiment is assigned to each of those. This is done for all review sentences in the training data. This comprised our dictionary. The following function will take each phrase iteratively and it will remove html content, remove non-alphabetic characters, tokenize the sentences, lemmatize each word to its lemma and then return the result in the list named ‘reviews’

Next, cleaned reviews for both train and test set retrieved

With following code, we are categorizing the output value in number of classes.

We have all the data ready for training. Let’s split into train, validation sets.

Further, we will be getting the no of unique words and max length of a review available in the list of cleaning reviews. It is needed for initializing the tokenizer of Keras and subsequent padding.

Following is the actual tokenizer of Keras and convert to sequences. This takes arguments as — list of texts to turn to sequences. And returns — list of sequences (one per text input).

Next padding is done to equalize the lengths of all input reviews. LSTM networks need all inputs to be the same length. Therefore reviews lesser than max length will be made equal using extra zeros at the end. This is padding.

To prevent overfitting, we will use early stopping as a callback in the model fitting function call. Let’s define it.

Let’s define the model i.e. Keras using LSTM. Also, we will use Multilayer Perceptron (MLP) for multi-class softmax classification. This architecture is specially designed to work on sequence data. It fits perfectly for many NLP tasks like tagging and text classification. It treats the text as a sequence rather than a bag of words or as ngrams.

Let’s make the prediction on the trained model. And submit the output.

Here we have created and sample output to write our predictions on test data. Here is our sampleoutput.csv looks like. This file represents how our output should look like.

Finally we will write our actual output to output.csv file.

Conclusion:

We have observed that using NLP methods, the sentiment prediction is quite accurate. It depends upon the training dictionary that we for model training. The accuracy depends on the number of epochs, dropout rate, batch size, number of layers, optimizer, length of words. So the best accuracy comes for a certain combination, so one would probably try and see what that combination is. In our case, the best accuracy obtained using a 128 layer LSTM model, dropout of 50%, and Adam optimizer.

Thanks for reading out the article. I hope it helps.
You can find the Github repo at this link.

References:

[1] Sentiment analysis, Wikipedia

[2] RNN model, Andrew Ng

[3] RNN Achitecture, andrew Ng

[4] GRU

--

--