Sentiment Analysis — using LSTM & GloVe Embeddings

Skillcate AI
8 min readJul 28, 2022

--

On the popular IMDb movie reviews dataset

As a quick summary, in this article we shall train three separate Neural Networks, namely: a Simple Neural Net, a Convolutional Neural Net (or CNN) and a Long Short Term Memory (or LSTM) Neural Net.

LSTM networks are actually considered to be quite suitable for handling NLP problems, and towards the end of this article, you will understand why.

Our Sentiment Classification LSTM Model shall have a strong~86% accuracy. The model not only does a fantastic job in classifying sentiments as positive / negative, but also predicts IMDb ratings for the reviews with insane accuracy. Take a look…

For these fresh reviews taken from imdb.com, along with their IMDb rating, we see the model prediction (adjusted on a 0–10 scale) are in remarkable coherence with the actual rating

Watch the video tutorial instead

We have also done a YouTube video tutorial on this project. Do check that out, if you are a video person! All project related files are kept here: Skillcate Project Toolkit.

Youtube tutorial on Sentiment Classification on Keras: link

Our plan of action is this: first, we setup the environment, by loading essential libraries and functions + loading the dataset. Second, we pre-process user reviews to filter out the non-value adding part. Third, we tokenise reviews, and prepare embedding matrix (we shall talk more on this) for our corpus using GloVe. Forth, we train three separate Deep Learning Models as discussed earlier. And finally, we make predictions on the live IMDb data to gauge (& appreciate) how our model performs.

Setting the environment

This step is a real straight forward one. First we import essential libraries and functions using this script:

Our dataset IMDb Movies Reviews could be downloaded from this source: link. This dataset has 50k movie reviews along with positive / negative sentiment labels. If you are using Google Colab, you may use the following code as is, otherwise make appropriate changes:

Data preprocessing

In this section, we analyse our reviews to check for non-value adding information. Below is one of the user movie reviews I randomly picked from our dataset. Over here you may see, we have punctuation marks, special characters, HTML tags, numbers, stop-words, etc.

I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I’d laughed at one of Woody’s comedies in years (dare I say a decade?). While I’ve never been impressed with Scarlet Johanson, in this she managed to tone down her “sexy” image and jumped right into a average, but spirited young woman.<br /><br />This may not be the crown jewel of his career, but it was wittier than “Devil Wears Prada” and more interesting than “Superman” a great comedy to go see with friends.

And of course, these add little to no value in telling whether a review is positive or negative. In this next step, we filter out these non-value adding parts from our reviews. In the later part of this script, we also covert our labels from positive / negative to 1’s / 0’s, respectively. Following script gets all of this done:

This is how the discussed review looks like after pre-processing:

thought wonderful way spend time hot summer weekend sitting air conditioned theater watching light hearted comedy plot simplistic dialogue witty characters likable even well bread suspected serial killer may disappointed realize match point risk addiction thought proof woody allen still fully control style many us grown love laughed one woody comedies years dare say decade never impressed scarlet johanson managed tone sexy image jumped right average spirited young woman may crown jewel career wittier devil wears prada interesting superman great comedy go see friends

Cool, right? Then, we split our dataset into 80:20 train:test using this code snippet:

Preparing the Embedding layer

To use text data as input to a neural network model, we need to convert text to numbers. There are popular techniques, like: one-hot encoding, which does a fair job in getting this done. However, unlike machine learning models, passing sparse vectors of huge sizes can greatly affect the performance of neural network models. Therefore, we need to figure out a way to rather convert our text to small dense vectors. And, this is where Word Embeddings come in.

Source: coderzcolumn.com

As a first step to this, we are using the Tokenizer class from the keras.preprocessing.text module to create a word-to-index dictionary. In the word-to-index dictionary, each word in the corpus is used as a key, while a corresponding unique index is used as the value for the key. Then, we compute the vocabulary size, which means our corpus has ~92,394 unique words

Next up, we are performing padding to set the length of all reviews to exactly 100 words. You may also try a different size. The lists with size greater than 100 will be truncated to 100. For the lists that have length less than 100, we will add 0 at the end of the list until it reaches the max length. Here’s the script to get things done till here:

Next up, we are using GloVe embeddings to create our feature matrix: link. For this, we first load the GloVe word embeddings and create a dictionary that contains words as keys and the corresponding embedding list as values.

And finally, we create an embedding matrix where each row number corresponds to the index of the word in the corpus. The matrix has 100 columns where each column contains the GloVe word embeddings for the words in our corpus.

Calling the shape function on the embedding_matrix we may also check for its shape. You should see 92,394 rows, which is our vocabulary size and 100 columns which are the 100 GloVe dimensions.

Train a Simple Neural Network

Alright, now we are at the model training step.

Here, we are creating a Sequential() model. Next, we create our embedding layer. The embedding layer will have an input length of 100, the output vector dimension will also be 100. The vocabulary size is 92,394 words. Since we are not training our own embeddings and using the GloVe embedding, we set trainable to ‘False’ and in the weights attribute we pass our own embedding matrix.

The embedding layer is then added to our model. Next, since we are directly connecting our embedding layer to a densely connected layer, so, we flatten the embedding layer. Finally, we add a dense layer with sigmoid activation function.

Then we compile our model, and start it’s training with this script:

Then we compute predictions on the test set. And analyse performance parameters and charts.

We get a test accuracy of ~75%. Our training accuracy was ~84%. This means that our model is overfitting on the training set. Overfitting occurs when the model performs better on the training set than the test set. Ideally, the performance difference between training and test sets should be minimum.

Next up, we move to training a CNN Model.

Train a Convolutional Neural Network

CNN is a type of network that is primarily used for 2D data classification, such as images. A convolutional network tries to find specific features in an image in the first layer. In the next layers, the initially detected features are joined together to form bigger features. In this way, the whole image is detected.

Convolutional neural networks have been found to do a fair job with text data as well. Though text data is one-dimensional, we can use 1D CNNs to extract features from our data

Here, we are creating a simple convolutional neural network with 1 convolutional layer and 1 pooling layer. Code up to the creation of the embedding layer is the remain same. Then we compile the model and start training.

Then we compute predictions on the test set; print and plot performance results.

Our test accuracy comes around a good ~85%, which is much better than the Simple Neural Net results. However, our CNN model is still overfitting as there is a vast difference between the training and test accuracy.

Alright, finally we come to the LSTM Model Training step.

Train an LSTM Neural Network

Recurrent Neural Network is a type of Neural Networks that is proven to work well with sequence data. And since text is actually a sequence of words, a recurrent neural network is an automatic choice to solve text-related problems. In this section, we are using LSTM which is a variant of RNN

Here, after the same embedding layer we have been using, we are inserting an LSTM layer with 128 neurons (You can play around with the number of neurons). Rest of the code remains the same. Then we compile and start model training.

And once training gets over, we compute model predictions on the test set, and plot charts, the same way.

Here, we get the highest test accuracy of ~87%. Even the performance plots shows a minuscule difference between the training and test accuracy values.

So, with this we may conclude that for our problem, LSTM is the most suited one. Guys, congratulations to you for making it to this point and training your sentiment classification model using neural networks.

To finish things up, we have this last section left here, where we make predictions on the fresh IMDb reviews.

Performing live queries

For this, we first load the test cases file, from the Skillcate Project Toolkit, called: IMDB_Unseen_Reviews. Here, we have the review text and the real IMDb rating, taken directly from IMDb. Then, as usual, we preprocess our reviews’ text, followed by tokenization and padding.

And then we call our LSTM model on these reviews for predictions. Output of the LSTM model is a number between 0 and 1, where 1 is positive sentiment. Then we are printing Model results vis-a-vis our test file data. FYI, we are also multiplying our model results with 10, so as to bring them on a scale of 0 to 10 (for the sake of comparison with real IMDb ratings, which also are on a 0–10 scale.

And, as you may see here, not only does our model predict positive reviews as positive and negative likewise, it also gives an insanely accurate prediction on the IMDb rating itself which is mind blowing, really.

For these fresh reviews taken from imdb.com, along with their IMDb rating, we see the model prediction (adjusted on a 0–10 scale) is in stark coherence with the actual rating

I would encourage you to prepare your own such test file and run predictions using your trained LSTM model.

Brief about Skillcate

At Skillcate, we are on a mission to bring you application based machine learning education. We launch new machine learning projects every week. So, make sure to subscribe to our youtube channel and also hit that bell icon, so you get notified when our new ML Projects go live.

And, we are always game to talk Machine Learning ❤️. In case you need any ML Project related help or want to discuss your crazy ideas, you can book a 1:1 online session with us by filling a brief registration form on our website homepage.

We are Skillcate

Shall be back soon with a new ML project. Until then, happy learning 🤗!!

--

--

Skillcate AI

Project-based courses — solving real business problems end-to-end. Book a free 1:1 mentoring session at skillcate.com :)