Movie Review Sentiment Analysis w/ RNNs

Published in

Analytics Vidhya

7 min readMar 28, 2021

According to IMDb, it has over half a million movie titles and almost 5.5 million user reviews, with each of these reviews is a 1–10 rating and text review; I want to use these reviews to train a model that can guess future reviews’ sentiment. The type of natural language processing is called text classification. Much of the work here is from Lesson 8 of the FastAI course which, I cannot recommend enough to anyone interested in deep learning. Follow along with the notebook used to train the model.

Preprocessing

Text is a pretty abstract concept to a computer that only understands ones and zeros-it needs work to get it ready for understanding human language. Each word in a corpus is like a categorical variable with a finite number of levels; the corpus levels are called the vocab.

There are two steps:

Tokenisation: create a list of words/subwords for the corpus.
Numericalisation: convert the words into numbers and their position into an index.

Tokenisation

Seems pretty straightforward: pass the text to a tokeniser and start training? Unfortunately, we need to handle things like punctuation and hyphenation. There are different ways of splitting text into words, but the one I’ll focus on here is Word-based. Luckily, there are word tokenisers-one of which is called spacy. It knows how to process words, like “it’s”.

first(spacy(["It's really sunny on Monday"]))
>> ['It', "'s", 'really', 'sunny', 'on', 'Monday']

FastAI adds its own special token rules to add context to the word, like the beginning of a review or a capital letter.

tok = Tokenizer(spacy)(["It's really sunny on Monday"]); tok
>> ['xxbos', 'xxmaj', 'it', "'s", 'really', 'sunny', 'on', 'xxmaj', monday']

xxbos signifies the beginning of the stream (bos) and xxmaj a capital letter. Words like “It” and “it” are the same word to us but, to the model, they are unique. These special tokens parse them as the same but with a capital letter token to distinguish them.

Numericalisation

The model now knows how to split each word in a review, but we still need to convert them into something numerical the computer understands. Numericalisation works like one-hot encoded variables, where Numericalize makes a list of all possible levels for a categorical variable replacing each with an index in the vocab.

num = Numericalize(min_freq=3, max_vocab=60000)
num.setup(tok)
num(tok)
>> tensor([ 4, 11, 435, 434, ... ])

Storing all words is not practical because the vocab would be too big, the max_vocab property limits it, while the min_freq require it to appear several times before being added. This tensor can now be fed directly into a network’s embedding layer.

RNNs

Recurrent Neural Networks are used in NLP because they weight the position of the text in a sequence. The basic idea is that the first layer of the network uses the first word’s embedding; the second uses the second word’s embedding combined with the first word’s output activations; the third uses the third word’s embedding with the second word’s activations. The previous word’s activations are called the hidden state.

Instead of explicitly declaring each layer, they refactor them into a for loop-this is what is recurrent. The for loop’s limit is the number of layers.

The first class above, LanguageModel, explicitly feeds the next layer of each network the previous activations; LanguageModelRecurrent uses a loop to implement the same. The model above yields about 50% accuracy.

State & Signal

In forward(), the hidden state, h, stores the previous activations; this is then reset to zero, discarding potentially important information about words that give more context to the review. Moving the reset into init() solves this problem by only resetting upon object instantiation. This maintains the state but opens another problem: gradient explosion. Maintaining the state creates a layer for each token in the corpus, which could be 60,000, requiring gradient calculations for each layer, making for slow training. Discarding all but the first three gradients solves this; this is called truncated backpropagation through time (tBPTT).

The second RNN implementation above adds signal by looping over the sequence length, sl, instead of just three words. Increasing the sentence length gives more context; potentially improving accuracy-this is called signalling. Introducing state increased the accuracy to 57% and, adding more signal brought it to 64%.

Multilayer & LSTMs

Multilayer RNNs take the first layer’s output and input it to the next, giving the model a longer time horizon to learn from and creating a better understanding of the text. Given this, the accuracy should improve by a lot but instead decreased to 48%, down 16 from our last single-layer model. The reason for this drop is phenomena called exploding and vanishing gradients. Repeatedly multiplying matrices causes their precision to degrade due to each floating-point numbers only having 32 bits; as floats stray from 1, they lose exactness as more bits are required to store their value. Passing a matrix through two layers causes its numbers to diverge from its actual value for each multiplication. If gradients are too small, the algorithm does not update; too big and, their updates are too drastic.

The long short-term memory (LSTM) architecture addresses this by retaining more memory of the sentence by introducing another hidden state called the cell state. The hidden state focuses on the current word token while the cell state takes into account activations of words earlier on in the word sequence.

The orange boxes in the diagram above represent layers in the network, tanh being hyperbolic tan and the other being sigmoid. Scaling outputs from 0 to 1 for sigmoid and -1 to +1 for tanh resolves the exploding gradients issue. The cell state controls the LSTM, which is updated using the diagram’s yellow circles; the product of their inputs decides the state: update those closer to 1; discard closer to 0. The network can now maintain long-term memory of words, making longer sentences easier to understand.

init() defines each gate used in LSTM, while forward() implements them. Conveniently, PyTorch has a built-in class, so no need to write all of that. With LSTM, the multilayered model accuracy went from 48% to 81%-quite a big jump! The model’s validation loss was far higher than that of the training, suggesting an overfit on the data.

Regularisation

In traditional ML models, this step is reasonably straightforward with just a regularisation term appended to the loss function:

Weight Decay Parameter

In NLP, the process is a little more complicated with some approaches using data augmentation, translating the text into another language and translating it back again to phrase sentences in a different way; this is an open area of research and is beyond this article’s scope.

Dropout

The idea behind dropout is randomly dropping random neurons within the network to help the network work toward a common goal. The exclusion of these neurons introduces noise to the network, making the model more robust and tending less to overfit.

Dropping the neurons is controlled by probability, p, varying layer to layer, with appropriate dropping in complicated network layers. The dropout follows a Bernoulli distribution defined below.

Bernoulli Distribution

Dropout implemented in PyTorch:

class Dropout(Module):
  def __init__(self):
    self.p = p
  
  def forward(self, x):
    if not self.training:
      return x
    mask.new(*x.shape).bernoulli_(1-p)
    return x * mask.div_(1-p)

Weight Tying

A language model’s input embeddings (first layer) map English words to activations and the output activations to English words. Setting these to be the same can improve accuracy. Here’s the paper.

self.h_o.weight = self.i_h.weight

FastAI supplies a TextLearner class to do most of the work for us:

learn = TextLearner(dls, LanguageModelLSTM(len(vocab), 64, 2, 0.4),
                loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(15, 1e-2, wd=0.1)

The accuracy went from 81% to 87%; that would have been the best in the world five years ago! FastAI’s creators trained a model using the same techniques as above to achieve an accuracy of 94%-only recently beaten.

Prediction

FastAI provides an IMDb dataset with 25,000 polarised reviews, accessed below with pretty simple syntax:

path = untar_data(URLs.IMDB)

DataBlock uses path to load the examples into the model.

dls_clas = DataBlock(
  blocks=(TextBlock.from_folder(path), CategoryBlock),
  get_y=parent_label,
  get_items=partial(get_text_files, folders=['train', 'test']),
  splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

Creating the model uses the DataBlock, AWD_LSTM: the regularised LSTM architecture and drop_multi: the global dropout. to_fp16() converts all 32-bit floats to 16-bit, making for faster training.

l3 = text_classifier_learner(
  dls_clas, 
  AWD_LSTM,
  drop_mult=0.5,
  metrics=accuracy
).to_fp16()

I recommend saving the trained model and vocab in pickle files, as it took over two hours to train it on Colab GPUs. Loading the model and making predictions is done below:

l3 = l3.load('/path/to/my_saved_model')
l3.predict('That was terrible!')
 >> ('neg', tensor(0), tensor([0.8067, 0.1933]))

Hosting

I would usually put this model in a Streamlit web app for inference, as I did in my previous projects, but because the saved model is so large, I will need to use cloud hosting.