Sentiment Analysis, Part 4 — A survey of Sentiment Analysis methods

Published in

Besedo Engineering Blog

7 min readMar 7, 2022

This is the 4th (and final) blog post on the Sentiment Analysis series. You can find the other blog posts of this series following these different links: Sentiment Analysis, Part 1 — A friendly guide to Sentiment Analysis, Sentiment Analysis, Part 2 — How to choose pre-annotated datasets for Sentiment Analysis?, and Sentiment Analysis, Part 3 — Data annotation. This blog post aims to show you the other methods you can use while doing a study on Sentiment Analysis. We also tried some of them to show what we had learned.

As a reminder, the objective of any Sentiment Analysis study is to have a model that can predict the sentiment of a text. At Besedo, we use this information for two purposes: content moderation and insights to our clients. There are different ways to predict a sentiment: a linguistic approach via rules and patterns and a data science approach. This blog post will only show the data science approach via Machine Learning and Deep Learning methods, including Transformers.

Our different approaches

We started our study with old approaches and moved slowly to what is currently used in the State Of The Art today.

Preprocessing

If you are using an automatic approach, the first step is to preprocess your data. Indeed, preprocessing is one of the most important steps as it can genuinely influence the performances of a model. Preprocessing can mean many things, and it is up to you to decide what kind you want to perform. It can be removing punctuation, normalizing words, lowercase the text, removing the emojis, remove the URLs.

Any preprocessing may lead to some information loss. For example, if we lowercase the text, we will lose information that a person was mentioning names (“paris” instead of “Paris”), or if we remove punctuation, it may impact the emotional behavior of the text (“I can’t wait for tommorrow !!!!!!!!!!”, “I can’t wait for tomorrow”).

Despite that, preprocessing is an integral part because some algorithms/models will not work very well with more expansive features space, especially if there is not a lot of data. That is why we are trying to dense information, and help models learn better.

Machine Learning approach

For Machine Learning (ML), we first need to transform the data to represent the text with vectors. This transformation can be made using different methods, and we are giving a non-exhaustive list of them in the following table. We also linked some blog posts about the different vectorizers, so take a look if you are curious about any of them!

Among the vectorizing method, you can use Word Embeddings. Word embeddings is a statistical approach that represents a word according to its frequent context. For example, “cat” and “dog” appear in similar contexts (“I fed my cat/dog”), whereas “cat” and “hat” do not (“I fed my cat/hat”). As words that appear in a similar context are semantically similar, this method is strong enough to represent the semantic meaning of words. Here is a small introduction to some of the libraries based on Word Embeddings:

GloVe: Stanford’s solution. It calculates a vector for each word using a predictive model.
Word2Vec: Google’s solution. It calculates a vector for each word using a count-based model.
FastText: Facebook’s solution. It calculates a vector for each n-gram character and adds them to obtain the word vector. This means that this solution can give a vector even for out of vocabulary words (words that are not in the embedding model)

Vectorizer name                 Blog post                  Library----------------- ---------------------------------------- ---------CountVectorizer   Basics of CountVectorizer                SklearnTF-IDF            (...) TF-IDF explained                   SklearnGloVe             Intuitive Guide (...) GloVe Embeddings   GloVeWord2Vec          Introduction to (...) Word2Vec           GensimFastText          A Visual Guide to FastText (...)         FastText

After transforming the data, we can train models. Here, we present Machine Learning models we used during our study, so this is a non-exclusive list, but they are the most commonly used models. You can find more information about each method in the linked blogs. We start our list with three ML models you can use with the sklearn library.

Naive Bayes: a probabilistic model based on the Bayes theorem (find the probability of A happening, given B has occurred) — Naive Bayes Classifier
Logistic Regression: a linear classification model using a logistic model — Logistic Regression — Detailed Overview
Random Forest: an ensemble method that takes the prediction of n different decision trees, the most recurrent prediction finally becomes our prediction. This idea is that a large ensemble of models will outperform individual ones — Understanding Random Forest.

We also used a boosting method called LightGBM (Light Gradient Boosting Machine). Boosting methods are also ensemble methods, similar to the Random Forest model, and they also work with decision trees. The difference between them is that boosting methods give weights to wrong predictions to have better results (What is LightGBM, How to implement it? How to fine-tune the parameters?).

The last method we will expose here is FastText which we already talked about it as a method to calculate embeddings. Still, FastText also has a classifier method (fastText for Text Classification).

Deep Learning approach

Deep Learning methods are more recent in the State Of The Art and have exceeded results obtained with Machine Learning in the past years. These models are more sophisticated as they do not require data transformation, as mentioned before, except tokenization (the process of splitting a text in tokens, aka words).

We researched methods commonly used for Sentiment Analysis that perform the best during our study. That lead us to use the following methods:

CNN (Convolutional Neural Network) is a neural network model often used in image recognition, but it performs well in the Sentiment Analysis task. You can find more information about it in the following blog post: Convolutional Neural Networks, Explained
LSTM is an improved method based on recurrent neural networks (RNNs). They can retain information about previous data, and that makes it particularly good at processing text (LSTM Networks | A Detailed Explanation)
Bert (Bidirectional Encoder Representations from Transformers) is a Transformers-based method created in 2018 and revolutionized the field of NLP as it achieved State Of The Art results in a lot of tasks. This method applies the bidirectional training of Transformers to language modeling. If you want to know more about it, we suggest you look at this blog post: The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)

Machine Learning vs. Deep Learning

After listing the different approaches, we would like to expose the possible advantages and disadvantages.

Advantages and disadvantages of ML/DL — Advantages and disadvantages of Machine Learning and Deep Learning

Results

We trained our different methods for both approaches on the same data to compare the results and decide the best approach for our study. Instead of using the default hyperparameters, we fine-tuned and optimized them for each method to obtain the best results. For example, while using Machine Learning approaches, we used the Optuna library that searches for the optimal set of hyperparameters. The choice of the hyperparameters is essential, especially using Bert, as Transformers is sensible to them. Not choosing the correct hyperparameters can lead to overfitting (when the model is trained too well and can’t predict an output on new data) or a phenomenon called catastrophic forgetting (when the model abruptly forgets every it learned when encountering new data).

We will present the results using the Fscore Macro on two pre-annotated datasets for Sentiment Analysis: Amazon Fine Food Reviews and Apple Twitter Sentiment.

Model results — Results (Fscore macro) for the models trained.

As you can see by observing the table, the best results for us were obtained using Bert. This makes sense because we had tiny datasets, and Bert is already pre-trained on a large amount of data. The best Fscores are 76.04 and 73.31 and have 8 and 6 more points than the 2nd best Fscores for each dataset.

Plus, we can see that, even if Deep Learning methods are supposed to perform better than Machine Learning ones, the difference between both is not apparent in this study. For example, Deep Learning methods only have 45 (CNN) and 61.69 (LSTM) Fscores for Apple Twitter Sentiment, whereas NaiveBayes has a Fscore of 64.54. This means that we do not always have to use big models to achieve good results.

This table also shows that we have very different results if we choose a different transformation on our data. For example, we used TF-IDF and FastText embeddings for transforming the data, and combined with NaiveBayes; we have a difference of 13 points on our Fscores.

This blog post exposed the different methods we used in our Sentiment Analysis study and the results we obtained. We can see that Transformers and specifically Bert work the best for us. That is why we continued our benchmarking of different methods with different Transformers, such as Roberta and DistillBert.

During our series on Sentiment Analysis and this blog post, we explained the steps we made during our study and the different tips and things we learned. We hope that these blog posts were helpful to you! :)