A comparison of sentiment analysis models using NLP on movie reviews

Sridhar G Kumar
The Startup
Published in
7 min readSep 1, 2022

For a while now, NLP tasks using machine learning has resorted to using BERT(Bidirectional Encoder Representations from Transformers) models which is considered the current gold standard. These models are commonly used in many of our day to day language processing tasks including cases such as Google search auto-complete. However, we have to wonder if a BERT model is the best option for every language processing task?

In this article, which was carried out as a collaborative work (Sridhar G Kumar, Federico Griggio, Helyne Adamson, Melissa Siddle)over a period of one week, we work with the imdb movie reviews dataset and perform a sentiment analysis to determine a corresponding rating for the movie review. We perform an exploratory estimation of achievable accuracies with the different models and try to explore the idea of a suitable tradeoff between model size, training time and their corresponding accuracies. The aim is to achieve comparable results with less complex models at much faster speeds for simpler tasks.

BERT Model

The current most popular approach for the purpose of NLP, it can achieve some truly incredible accuracies for different tasks such as sentence prediction, sentiment analysis, chatbot replies, text summary, etc. BERT is at its core a transformer language model with a variable number of encoder layers and self-attention heads. This enables the model to perform contextualised embedding of words and therefore preserves more meaning in the embeddings. The two primary BERT models are BERT base and BERT Large and their architectures are shown in the figure below [BERT 101].

As we can see from the figure, the number of trainable parameters for even the base model is quite large and it would be resource intensive in spite of the possible high accuracies. For this purpose, we use BERT small in our tasks as it provides a more comparable model with training times similar to the other models we wish to perform our comparison with. The target accuracy that we are choosing our model for is 85%. This was achieved with a relatively small BERT model with 4 transformer blocks, hidden size of 512 and 8 attention heads. The evaluated accuracy on the test set was 85.5 % with a moderate training time of approx. 30 mins. We then try to achieve an accuracy comparable to this value using a few different models and approaches. An aspect to note is that the other models do not use contextualised embedding which is not of great importance to our particular use case since the key words that determine the quality of the movie have a small dependence on the context.

Naive Bayes Model

Simple models save time and computational resources, which, depending on your project, you may not have a lot of. Thus, we begin our comparison with a Naive Bayes model, which we used to classify the texts into either positive or negative reviews. Naive Bayes is one of the most straightforward and fast classification algorithms and has been successfully used in a variety of NLP tasks, notably spam filtering and text classification. It is a supervised learning algorithm that uses the Bayes theorem of probability for the prediction of unknown classes. It’s “naive” because it assumes conditional independence between every pair of features (words, in our case).

In natural human language, the set, frequency, and especially order of words convey contextual information (e.g. the difference in meaning between “good” and “not good”). Despite these assumptions of conditional independence, Naive Bayes can often have a high degree of accuracy. In our case, the positive or negative affect of a movie review tends to rely on the semantic content of a few key words (amazing, awful, excellent, etc.) over contextual word order, making this a suitable task for the Naive Bayes model.

Because the choice of text preprocessing strongly influences NLP results, we applied standard text cleaning (inclusion of only words), word tokenization with NLTK, stop-word removal (including custom stop-words specific to the dataset, like “film” and “movie”), and lemmatization of nouns and verbs. Here, we used a Bag-of-Words vectorization approach, which transforms the text into word frequencies. After creating a list of most commonly used words in the entire corpus, we created feature-sets for each review that indicated whether each of the most common words was present in the review or not. These feature-sets were the fed to the NLTK Naive Bayes classifier (train time < 5 seconds). Using this input, our model performed with an accuracy of 83%, with precision of 87% and 90% for positive and negative classification respectively. The NLTK Naive Bayes classifier additionally offers a method to
view the most informative words for the model, which for our task were “underrate”, “ridiculous”, and “unfunny”.

While these results are quite good, this approach is limited by the words in the corpus it is trained on. Nevertheless, for a basic binary classification of texts, the NLTK Naive Bayes classifier may be a good starting point.

Word2Vec Embedding with LSTM

The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by those vectors.

Word2vec pre-trained models are used alongside bidirectional recurrent neural networks in order to provide the word embeddings for our deep learning model training. This is a popular package from gensim with different embedding lengths available and differentiated based on the training dataset used. For our use case we use the wikipedia pre-trained model with an embedding length of 100.

Prior to vectorisation the dataset is preprocessed (stop-words removal, lemmatising, etc) using NLTK, similar to the preprocessing performed for the Naive Bayes method, and then padded to a length of 200 words which is sufficient to analyse the overall gist of the movie review. The neural network consists of bidirectional LSTM layers along with a few dense layers. The Adam optimiser was used and tested for a few learning rates in order to obtain an accuracy of approx. 80%. Further improvements to the model could be carried out but we chose to try the technique of cyclic learning rates instead as a more direct comparison which is further detailed in the sections below.

Word2Vec Embedding with CNN

This is a variation of the previous model that makes use of a one dimensional convolutional neural network alongside the same pre-trained Word2vec model. The chosen embedding length and the padding length are the same, 100 and 200 respectively. Conv1d models have fewer training parameters compared to the RNN counterpart and therefore are much faster to train. This model however had a tendency to overfit during training. This was solved by optimising the learning rates, and adding dropout layers. Additionally regularisers were also used in the CNN layers in order to ovoid overfitting. This model enabled us to approach accuracies of 85.6 % similar to the BERT model but at a fraction of the time required for training.

Tokenizer Embedding with CNN

A further variation for improved speed was then attempted by implementing a tokenizer based embedding in lieu of the previously used Word2vec model. Tokenizer allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, etc. The tokenized sequences are padded to a similar length of 200 words to keeps our results comparable to our previous models. The CNN model itself was the same as our previous one with regularisers and dropout layers.

This yielded us with an accuracy of 86 % on our test set and the model was the fastest to train as well. This goes to show that the simplest and fastest model can accomplish the same task with comparable accuracies to a more complex pre-trained model if our model is well suited to the task.

Optimizer with Cyclic Learning Rates

An interesting improvement was found implementing CLR [Cyclical Learning Rates, Leslie N. Smith 2017] on the LSTM model. Adam optimizer is quite a sophisticated algorithm, but still, like many other adaptive optimizers has some weakness. During the training, when Adam encounters a saddle point, the learning became harder, because the gradient of the loss tends to zero, reducing the model parameters update speed. CLR provide a solution to this problem. Cyclically increasing the learning rate in proximity of a saddle point, produce a beneficial speed-up of the learning process, jumping far from the saddle point, in the direction determined by the adaptive optimizer. After testing CLR performance with RMSprop and SGD Nesterov, we’ve noticed a better performance with Adam.

Using the same LSTM model mentioned before, with CLR we achieved an accuracy of 87%. Due to the faster training, we were able to reach this accuracy on the entire dataset in less than a minute.

Conclusions

Our comparisons aids us to clearly establish that while complex established models do indeed provide us with high levels of accuracy and can be used for their specified tasks, doing so is not always in our best interest. The machine learning model must be carefully chosen for the task to offer the best tradeoff between resource usage and accuracy, and seeking the best accuracy is not always the ideal option.

In today’s world, machine learning finds its uses in almost all aspects of our lives. ML tasks that appear similar on the surface level, might be inherently different in their features and finer details. As data scientists we must be careful when choosing the right model for a task; it is not sufficient to simply apply an established model. We must be rigorous in our approach to design and train the best model for the specific task at hand.

Links

While this article provides an overview of the used models and our conclusions, the actual code and the notebooks used to carry out the analysis can be found on our github page.

Our fastest model was further developed into a web-app and can be accessed at:

https://nlp-movie-review-anffoy276a-ey.a.run.app

--

--