Best open-source models for sentiment analysis — Part 2: neural networks

Use neural networks for higher accuracies if you have a GPU

Pavlo Fesenko, PhD
22 min readOct 5, 2023
Photo by Anastase Maragos on Unsplash

Recap

In Part 1 of this article series, I introduced my methodology of comparing models for sentiment analysis and applied it to 4 popular open-source dictionary models: TextBlob Pattern, TextBlob Naive Bayes, NLTK VADER, and Sentimentr. In this article, I will additionally test 9 neural network models which are known to be great candidates for different NLP tasks, including sentiment analysis. If you haven’t read Part 1, I strongly encourage you to check it out so that you understand the methodology and realize the leap in performance that was achieved by neural networks compared to dictionary models.

Computations

The code for this article was executed in the Google Colab notebook. For neural network models, I used 1 GPU (NVIDIA Tesla T4 16GB), 13 GB of RAM, and an inference batch size of 8. Calculation times of the models are presented as approximate values averaged over 3 rounds and can get even higher, especially when the number of tokens in texts increases.

📜 Stanza

Stanza is a Python package for the famous Stanford CoreNLP toolkit that was initially written in Java more than 10 years ago. Unlike traditional NLP models, Stanza uses a convolutional neural network (CNN) for sentiment classification. CNNs are more often encountered in computer vision but they turned out to be a valid solution for NLP as well. For those who are curious to know more, I refer you to the original paper [Kim, 2014] although since its first publication, the model has changed a bit [Qi et al., 2020].

Here is a brief description of Stanza:

  • LSTM + CNN architecture (56 million parameters)
  • English, Chinese, German
  • Word embeddings were initialized as non-trainable CoNLL vectors [Zeman et al., 2017] and concatenated with trainable features from forward/backward character level models
  • The model was trained on 5 datasets: Stanford sentiment treebank (SST), multimodal emotion lines dataset (MELD), sentiment labeled sentences dataset (SLSD), ArguAna TripAdvisor corpus, and Twitter US airline sentiment dataset.
  • Labels are mapped into 3 classes (negative, neutral, positive)
  • Outputs a polarity class
  • 230 seconds for 10 000 texts (1 text = 100 tokens)

Stanza outputs only polarity classes so I mapped them to the standard polarity range as follows: negative → -1, neutral → 0, positive → 1. The model successfully identified sentiment in the simple examples below, however, the amplification would be impossible to detect without class probabilities. I hope that this feature will be added to the package in the future.

The movie was great                 1.0   -> ✅ Simple positive
The movie was really great 1.0 -> ❔ Amplified positive
The movie was not great -1.0 -> ✅ Simple negative
The movie was really not great -1.0 -> ❔ Amplified negative
The movie was not that great -1.0 -> ✅ Longer negative
The movie could have been better -1.0 -> ✅ More complex negative

The classification metrics by Stanza are noticeably better than any dictionary model (accuracy 0.82–0.93) and there is no problem with detecting negative sentiment (recall for the negative class 0.91–0.97). There is, however, an issue with low ratios of texts classified as negative/positive for tweets (ratio 0.49) and financial phrases (ratio 0.31). This is a typical result when using the multi-class model as-is for binary classification. If class probabilities were provided, they could be transformed into a polarity and binary classification could be performed using a polarity threshold of 0 achieving a maximum ratio of 1. In that case I would also expect lower accuracies as observed on the polarity threshold graphs for other models.

Stanza — yelp:

  • Accuracy 0.93, Ratio 0.98
  • Negative class: Precision 0.9, Recall 0.97, F1 0.94
  • Positive class: Precision 0.97, Recall 0.89, F1 0.93

Stanza — tweet:

  • Accuracy 0.9, Ratio 0.49
  • Negative class: Precision 0.93, Recall 0.91, F1 0.92
  • Positive class: Precision 0.84, Recall 0.87, F1 0.86

Stanza — finance:

  • Accuracy 0.82, Ratio 0.31
  • Negative class: Precision 0.65, Recall 0.91, F1 0.76
  • Positive class: Precision 0.95, Recall 0.78, F1 0.85

🔥 Flair RNN

Flair is another great NLP package that was released in 2018 by Zalando [Akbik et al., 2019] and quickly gained popularity since then. For sentiment analysis Flair offers 2 models that are based on a recurrent neural network (RNN) and a transformer neural network. RNN is claimed to be faster but less accurate compared to the transformer so let’s test them on our datasets.

Here is a brief description of Flair RNN:

  • LSTM architecture (0.66 million parameters)
  • English
  • Word embeddings were initialized as non-trainable fastText vectors [Mikolov et al., 2018] with a reprojection layer
  • Model weights were trained on 4 datasets: 5-core reviews from the Amazon review dataset, Stanford sentiment treebank without neutral phrases and with binary labels (SST-2), IMDB dataset, movie review dataset
  • Labels are mapped into 2 classes (negative, positive)
  • Outputs class probabilities
  • 40 seconds for 10 000 texts (1 text = 100 tokens)

Flair RNN correctly identified the sentiment of simple examples but intensifier words didn’t increase the polarity score. This is probably due to only 2 classes (negative/positive) that were used during training so the model has never seen different sentiment degrees. Mapping labels into 3 (negative/neutral/positive) or 5 classes (negative/somewhat negative/neutral/somewhat positive/positive) could potentially help in this case.

The movie was great                 0.84534   -> ✅ Simple positive
The movie was really great 0.7819 -> ❔ Amplified positive
The movie was not great -0.99959 -> ✅ Simple negative
The movie was really not great -0.99843 -> ❔ Amplified negative
The movie was not that great -0.9996 -> ✅ Longer negative
The movie could have been better -0.94291 -> ✅ More complex negative

The classification metrics by Flair RNN are comparable to Stanza but only for Yelp reviews (accuracy 0.96). The metrics for tweets (accuracy 0.78) and for financial phrases (accuracy 0.6) are worse than in Stanza but consider that Flair RNN has a much higher ratio of classified negative/positive texts. If we adjust ratios similar to Stanza (0.49 for tweets and 0.31 for financial phrases) using the polarity threshold graphs below, then Flair RNN accuracies will also get close to Stanza. This is a great example of how it is possible to compare models even though one of them doesn’t output class probabilities.

I was quite surprised by such low performance of Flair RNN for financial phrases (accuracy 0.6), especially because of a large number of positive texts misclassified as negative (recall for the positive class 0.48). At this moment I don’t have a valid hypothesis to explain this but it’s an interesting topic for future research.

For Flair RNN, it’s also possible to calculate the classification metrics using a maximum class probability. But since the model outputs probabilities for only 2 classes, this approach is equivalent to using polarity threshold 0 and the results will be the same. If the model outputs probabilities for more than 2 classes, the classification results will be different and will be reported separately.

Flair RNN — yelp (threshold 0):

  • Accuracy 0.96, Ratio 1.0
  • Negative class: Precision 0.96, Recall 0.97, F1 0.96
  • Positive class: Precision 0.97, Recall 0.96, F1 0.96

Flair RNN — tweet (threshold 0):

  • Accuracy 0.78, Ratio 1.0
  • Negative class: Precision 0.89, Recall 0.74, F1 0.81
  • Positive class: Precision 0.66, Recall 0.85, F1 0.74

Flair RNN — finance (threshold 0):

  • Accuracy 0.6, Ratio 1.0
  • Negative class: Precision 0.43, Recall 0.89, F1 0.58
  • Positive class: Precision 0.91, Recall 0.48, F1 0.63

🔥 Flair DistilBERT

Flair’s default model for sentiment analysis is the fine-tuned transformer DistilBERT. The DistilBERT base model is a smaller distilled version of the BERT model that was originally developed by HuggingFace [Sanh et al., 2020].

Here is a brief description of Flair DistilBERT:

  • Transformer DistilBERT architecture (67 million parameters)
  • English
  • Both word embeddings and model weights of the base model were pre-trained on English Wikipedia and Toronto book corpus
  • The model head was fine-tuned on 4 datasets: 5-core reviews from the Amazon review dataset, Stanford sentiment treebank without neutral phrases and with binary labels (SST-2), IMDB dataset, movie review dataset
  • Labels are mapped into 2 classes (negative, positive)
  • Outputs class probabilities
  • 60 seconds for 10 000 texts (1 text = 100 tokens)

Like the other neural network models, Flair DistilBERT recognized sentiment signs for the simple examples correctly. Also, it detected higher polarity for the amplified positive example but this could be due to fluctuations in probabilities since the model was only trained on 2 classes (negative/positive) and didn’t see different sentiment degrees during training.

The movie was great                 0.90476   -> ✅ Simple positive
The movie was really great 0.9369 -> ✅ Amplified positive
The movie was not great -0.99895 -> ✅ Simple negative
The movie was really not great -0.99819 -> ❔ Amplified negative
The movie was not that great -0.99941 -> ✅ Longer negative
The movie could have been better -0.99972 -> ✅ More complex negative

The classification metrics by Flair DistilBERT are comparable to Flair RNN for Yelp reviews (accuracy 0.97) but noticeably better for tweets (accuracy 0.84) and financial phrases (accuracy 0.75). This time, the misclassification of positive financial phrases (recall for the positive class 0.68) isn’t as bad as in Flair RNN but still not ideal. Still, if you have enough computing resources, I would definitely recommend Flair DistilBERT over Flair RNN.

Flair DistilBERT — yelp (threshold 0):

  • Accuracy 0.97, Ratio 1.0
  • Negative class: Precision 0.96, Recall 0.98, F1 0.97
  • Positive class: Precision 0.98, Recall 0.96, F1 0.97

Flair DistilBERT — tweet (threshold 0):

  • Accuracy 0.84, Ratio 1.0
  • Negative class: Precision 0.86, Recall 0.89, F1 0.88
  • Positive class: Precision 0.81, Recall 0.76, F1 0.78

Flair DistilBERT — finance (threshold 0):

  • Accuracy 0.75, Ratio 1.0
  • Negative class: Precision 0.56, Recall 0.92, F1 0.7
  • Positive class: Precision 0.95, Recall 0.68, F1 0.79

🤗 HuggingFace DistilBERT

HuggingFace Transformers was one of the first Python libraries that democratized transformer neural networks and nowadays you can find most of the latest transformer models on their model hub.

HuggingFace also uses DistilBERT as a default model for sentiment analysis. It’s the same as Flair DistilBERT except that it was fine-tuned on only 1 dataset and that it is almost x2 faster (surprisingly!) than Flair DistilBERT.

Here is a brief description of HuggingFace DistilBERT:

  • Transformer DistilBERT architecture (67 million parameters)
  • English
  • Both word embeddings and model weights of the base model were trained on English Wikipedia and Toronto book corpus
  • The model head was fine-tuned on Stanford sentiment treebank without neutral phrases and with binary labels (SST-2)
  • Labels are mapped into 2 classes (negative, positive)
  • Outputs class probabilities
  • 35 seconds for 10 000 texts (1 text = 100 tokens)

HuggingFace DistilBERT performs similarly to Flair DistilBERT on simple examples although this time amplification was correctly detected for the negative example instead of the positive one which again could be due to fluctuations in probabilities.

The movie was great                 0.999742   -> ✅ Simple positive
The movie was really great 0.999736 -> ❔ Amplified positive
The movie was not great -0.999497 -> ✅ Simple negative
The movie was really not great -0.999503 -> ✅ Amplified negative
The movie was not that great -0.999505 -> ✅ Longer negative
The movie could have been better -0.99741 -> ✅ More complex negative

The classification metrics by HuggingFace DistilBERT are good (accuracy 0.73–0.95) and comparable to Flair DistilBERT. The latter slightly outperforms HuggingFace DistilBERT which could be related to the larger number of training datasets that Flair DistilBERT used for fine-tuning.

HuggingFace DistilBERT — yelp (threshold 0):

  • Accuracy 0.95, Ratio 1.0
  • Negative class: Precision 0.93, Recall 0.96, F1 0.95
  • Positive class: Precision 0.96, Recall 0.93, F1 0.95

HuggingFace DistilBERT — tweet (threshold 0):

  • Accuracy 0.81, Ratio 1.0
  • Negative class: Precision 0.82, Recall 0.89, F1 0.86
  • Positive class: Precision 0.79, Recall 0.68, F1 0.73

HuggingFace DistilBERT — finance (threshold 0):

  • Accuracy 0.73, Ratio 1.0
  • Negative class: Precision 0.53, Recall 0.98, F1 0.69
  • Positive class: Precision 0.99, Recall 0.62, F1 0.76

🤗 HuggingFace GPT-2

HuggingFace has another model for sentiment analysis that was also fine-tuned on the SST-2 dataset but used a more advanced architecture GPT-2. The latter was developed in 2019 by OpenAI [Radford & Wu et al., 2019] and was a stepping stone towards the famous model ChatGPT. I was curious what impact of the base model would be and HuggingFace GPT-2 is a great candidate for this test.

Here is a brief description of HuggingFace GPT-2:

  • Transformer GPT-2 Medium architecture (355 million parameters)
  • English
  • Both word embeddings and model weights of the base model were trained on the OpenAI internal dataset WebText consisting of web-crawled Reddit links with at least 3 karmas (excluding Wikipedia pages)
  • The model head was fine-tuned on Stanford Sentiment Treebank without neutral phrases and with binary labels (SST-2)
  • Labels are mapped into 2 classes (negative, positive)
  • Outputs class probabilities
  • 280 seconds for 10 000 texts (1 text = 100 tokens)

HuggingFace GPT-2 successfully identified polarity for most of the simple examples including amplifications but I doubt that the model is capable of detecting different sentiment degrees since it was only trained on 2 classes, and this result could be just a coincidence. To my surprise, the model predicted the last complex negative example as positive but as you will see later, many advanced models struggle with this example. Maybe it’s indeed too ambiguous but let me know in the comments.

The movie was great                 0.99959   -> ✅ Simple positive
The movie was really great 0.99963 -> ✅ Amplified positive
The movie was not great -0.95843 -> ✅ Simple negative
The movie was really not great -0.97521 -> ✅ Amplified negative
The movie was not that great -0.97655 -> ✅ Longer negative
The movie could have been better 0.46207 -> ❌ More complex negative

The classification metrics of HuggingFace GPT-2 are similar to HuggingFace DistilBERT for Yelp reviews (accuracy 0.95) and tweets (accuracy 0.83) but much better for financial phrases (accuracy 0.83). This must be due to a very rich representation of the base model GPT-2 that is much larger both in the number of parameters and in the amount of training data.

HuggingFace GPT-2 — yelp (threshold 0):

  • Accuracy 0.95, Ratio 1.0
  • Negative class: Precision 0.95, Recall 0.95, F1 0.95
  • Positive class: Precision 0.95, Recall 0.95, F1 0.95

HuggingFace GPT-2 — tweet (threshold 0):

  • Accuracy 0.83, Ratio 1.0
  • Negative class: Precision 0.92, Recall 0.8, F1 0.85
  • Positive class: Precision 0.72, Recall 0.88, F1 0.79

HuggingFace GPT-2 — finance (threshold 0):

  • Accuracy 0.85, Ratio 1.0
  • Negative class: Precision 0.69, Recall 0.89, F1 0.78
  • Positive class: Precision 0.94, Recall 0.83, F1 0.88

🤗 HuggingFace BERT

HuggingFace BERT was one of the first multilingual models for sentiment analysis in HuggingFace and was developed by NLP Town. The base model BERT was originally created by Google and more details can be found in their paper [Devlin et al., 2019]. HuggingFace BERT outputs sentiment probabilities for 5 classes so it would be interesting to see if there is any improvement in detecting sentiment degrees.

Here is a brief description of HuggingFace BERT:

  • Transformer BERT architecture (167 million parameters)
  • English, Dutch, German, French, Spanish, Italian
  • Both word embeddings and model weights of the base model were trained on English Wikipedia and Toronto book corpus
  • The model head was fine-tuned on product reviews
  • Labels are mapped into 5 classes (1 to 5 stars)
  • Outputs class probabilities
  • 70 seconds for 10 000 texts (1 text = 100 tokens)

HuggingFace BERT identified sentiment signs of all simple examples correctly and the polarity values seem to be more varied as compared to the models that were only fine-tuned on 2 classes. The amplification was detected for the positive example but unfortunately not for the negative one. Nevertheless, I do think that this model has potential for more nuanced polarities because the model predicted weaker sentiment for the last two negative examples which is indeed the case.

The movie was great                 0.66145   -> ✅ Simple positive
The movie was really great 0.75163 -> ✅ Amplified positive
The movie was not great -0.46273 -> ✅ Simple negative
The movie was really not great -0.43496 -> ❔ Amplified negative
The movie was not that great -0.22785 -> ✅ Longer negative
The movie could have been better -0.15772 -> ✅ More complex negative

The classification metrics by HuggingFace BERT are similar to HuggingFace DistilBERT for Yelp reviews (accuracy 0.96) and tweets (accuracy 0.84) but much worse for financial phrases (accuracy 0.63). This low accuracy is due to misclassification of positive examples for financial phrases (recall for the positive class 0.5) which is as bad as Flair RNN. This is interesting because both models have very different architectures and also different training datasets but tend to fail with the same testing dataset.

For classification metrics using a maximum class probability, I mapped the classes as follows: 1 or 2 stars → negative, 3 stars → neutral, 4 or 5 stars → positive. The results turned out to be very close to the method with a polarity threshold which indicates that a neutral class from the fine-tuning dataset must have had a quite narrow polarity range close to 0. This makes sense because the split into 5 classes would lead to a narrower class range than, for example, the split into 3 classes.

HuggingFace BERT — yelp (threshold 0):

  • Accuracy 0.96, Ratio 1.0
  • Negative class: Precision 0.95, Recall 0.96, F1 0.96
  • Positive class: Precision 0.96, Recall 0.95, F1 0.95

HuggingFace BERT — yelp (maximum probability):

  • Accuracy 0.96, Ratio 0.97
  • Negative class: Precision 0.96, Recall 0.97, F1 0.96
  • Positive class: Precision 0.97, Recall 0.96, F1 0.96

HuggingFace BERT — tweet (threshold 0):

  • Accuracy 0.84, Ratio 1.0
  • Negative class: Precision 0.88, Recall 0.87, F1 0.87
  • Positive class: Precision 0.79, Recall 0.79, F1 0.79

HuggingFace BERT — tweet (maximum probability):

  • Accuracy 0.85, Ratio 0.94
  • Negative class: Precision 0.89, Recall 0.88, F1 0.88
  • Positive class: Precision 0.8, Recall 0.81, F1 0.8

HuggingFace BERT — finance (threshold 0):

  • Accuracy 0.63, Ratio 1.0
  • Negative class: Precision 0.45, Recall 0.94, F1 0.61
  • Positive class: Precision 0.95, Recall 0.5, F1 0.65

HuggingFace BERT — finance (maximum probability):

  • Accuracy 0.64, Ratio 0.91
  • Negative class: Precision 0.46, Recall 0.96, F1 0.62
  • Positive class: Precision 0.96, Recall 0.5, F1 0.66

🐦 TweetNLP RoBERTa

TweetNLP is a Python package that was created specifically for analyzing social media by the NLP group at Cardiff University. It has a lot of models for different NLP tasks among which 2 models for sentiment analysis: RoBERTa (English) [Loureiro et al., 2022] and RoBERTa XLM (multilingual) [Barbieri et al., 2022]. Their distinct feature is that their base models were additionally trained on tweets before fine-tuning for sentiment analysis. Credits for the original base models RoBERTa [Liu et al., 2019] and RoBERTa XLM [Conneau & Khandelwal et al., 2020] go to Meta who started with the underlying BERT model but optimized its training procedure and used more training data. I will test both models but will start with the English one.

Here is a brief description of TweetNLP RoBERTa:

  • Transformer RoBERTa architecture (125 million parameters)
  • English
  • Both word embeddings and model weights of the base model were trained on 5 datasets: English Wikipedia, Toronto book corpus, CommonCrawl news, OpenWebText, stories dataset
  • The base model was additionally trained further on 124 million tweets
  • The model head was fine-tuned on TweetEval which uses SemEval 2017 as a sentiment dataset
  • Labels are mapped into 3 classes (negative, neutral, positive)
  • Outputs class probabilities
  • 105 seconds for 10 000 texts (1 text = 100 tokens)

TweetNLP RoBERTa managed to detect sentiment signs for most simple examples as well as amplifications in both positive and negative cases. It looks like having at least 3 classes during fine-tuning does help with identifying sentiment degrees. Unfortunately, the last more complex negative example was incorrectly labeled as positive but its polarity is close to 0 and its sentiment is indeed quite weak.

The movie was great                 0.96618   -> ✅ Simple positive
The movie was really great 0.97788 -> ✅ Amplified positive
The movie was not great -0.85609 -> ✅ Simple negative
The movie was really not great -0.90349 -> ✅ Amplified negative
The movie was not that great -0.86232 -> ✅ Longer negative
The movie could have been better 0.092792 -> ❌ More complex negative

The classification metrics by TweetNLP RoBERTa are very good for all 3 datasets (accuracy 0.92–0.95). You might have noticed that TweetNLP RoBERTa was fine-tuned on the same tweet dataset that I used for testing in this article, however, I only took the test split so the model should not have seen this data during fine-tuning. Also, it’s quite remarkable that TweetNLP RoBERTa showed great performance for product reviews and financial phrases although it was only fine-tuned on tweets. I think it is because tweets might contain very different texts, often with specialized terms.

When using a maximum class probability, the classification metrics slightly improve but at the same time the ratio of texts classified as negative/positive decreases a lot for tweets (ratio 0.8) and for financial phrases (ratio 0.36). This could be explained by the same hypothesis that was used for HuggingFace BERT, namely, if the model was fine-tuned on 3 classes, then the neutral class would have a wider polarity range compared to the case with 5 classes. And as result, some negative/positive texts with weaker sentiment might be classified as neutral.

TweetNLP RoBERTa — yelp (threshold 0):

  • Accuracy 0.92, Ratio 1.0
  • Negative class: Precision 0.96, Recall 0.88, F1 0.92
  • Positive class: Precision 0.89, Recall 0.96, F1 0.92

TweetNLP RoBERTa — yelp (maximum probability):

  • Accuracy 0.94, Ratio 0.93
  • Negative class: Precision 0.97, Recall 0.9, F1 0.94
  • Positive class: Precision 0.92, Recall 0.97, F1 0.94

TweetNLP RoBERTa — tweet (threshold 0):

  • Accuracy 0.95, Ratio 1.0
  • Negative class: Precision 0.96, Recall 0.95, F1 0.96
  • Positive class: Precision 0.92, Recall 0.94, F1 0.93

TweetNLP RoBERTa — tweet (maximum probability):

  • Accuracy 0.97, Ratio 0.8
  • Negative class: Precision 0.98, Recall 0.98, F1 0.98
  • Positive class: Precision 0.97, Recall 0.96, F1 0.96

TweetNLP RoBERTa — finance (threshold 0):

  • Accuracy 0.94, Ratio 1.0
  • Negative class: Precision 0.88, Recall 0.92, F1 0.9
  • Positive class: Precision 0.96, Recall 0.94, F1 0.95

TweetNLP RoBERTa — finance (maximum probability):

  • Accuracy 0.99, Ratio 0.36
  • Negative class: Precision 0.97, Recall 1.0, F1 0.98
  • Positive class: Precision 1.0, Recall 0.99, F1 0.99

🐦 TweetNLP RoBERTa XLM

In addition to the English model for sentiment analysis, TweetNLP also offers a multilingual version RoBERTa XLM. Typically multilingual models slightly underperform monolingual ones so let’s test it out.

Here is a brief description of TweetNLP RoBERTa XLM:

  • Transformer RoBERTa XLM architecture (278 million parameters)
  • English, Arabic, French, German, Hindi, Italian, Portuguese, Spanish
  • Both word embeddings and model weights of the base model were trained on filtered CommonCrawl data in 100 languages
  • The base model was additionally trained further on 198 million tweets
  • The model head was fine-tuned on the 8 datasets: English (SemEval 2017), Arabic (SemEval 2017), French (Deft 2017), German (SB-10K), Hindi (SAIL 2015), Italian (Sentipolc 2016), Portuguese (SentiBR), Spanish (InterTASS 2017)
  • Labels are mapped into 3 classes (negative, neutral, positive)
  • Outputs class probabilities
  • 110 seconds for 10 000 texts (1 text = 100 tokens)

TweetNLP RoBERTa XLM got all simple examples correctly including amplifications and a more complex negative one. This looks very promising but let’s check the results on the test datasets.

The movie was great                 0.90598   -> ✅ Simple positive
The movie was really great 0.91268 -> ✅ Amplified positive
The movie was not great -0.89058 -> ✅ Simple negative
The movie was really not great -0.91186 -> ✅ Amplified negative
The movie was not that great -0.68526 -> ✅ Longer negative
The movie could have been better -0.23253 -> ✅ More complex negative

The classification metrics by TweetNLP RoBERTa XLM are also very good (accuracy 0.82–0.93) but slightly lower compared to the English model TweetNLP RoBERTa as was mentioned by the model authors. The largest decrease is observed for the financial phrases (accuracy 0.82), however, it seems like a reasonable trade-off for the multilingual capability.

When using a maximum class probability instead of a polarity threshold, the ratio of texts classified as negative/positive predictably drops to the values similar to TweetNLP RoBERTa.

TweetNLP RoBERTa XLM — yelp (threshold 0):

  • Accuracy 0.9, Ratio 1.0
  • Negative class: Precision 0.86, Recall 0.96, F1 0.91
  • Positive class: Precision 0.96, Recall 0.84, F1 0.9

TweetNLP RoBERTa XLM — yelp (maximum probability):

  • Accuracy 0.92, Ratio 0.93
  • Negative class: Precision 0.88, Recall 0.97, F1 0.92
  • Positive class: Precision 0.97, Recall 0.87, F1 0.92

TweetNLP RoBERTa XLM — tweet (threshold 0):

  • Accuracy 0.93, Ratio 1.0
  • Negative class: Precision 0.93, Recall 0.95, F1 0.94
  • Positive class: Precision 0.91, Recall 0.89, F1 0.9

TweetNLP RoBERTa XLM — tweet (maximum probability):

  • Accuracy 0.95, Ratio 0.85
  • Negative class: Precision 0.96, Recall 0.97, F1 0.96
  • Positive class: Precision 0.94, Recall 0.91, F1 0.93

TweetNLP RoBERTa XLM — finance (threshold 0):

  • Accuracy 0.82, Ratio 1.0
  • Negative class: Precision 0.73, Recall 0.67, F1 0.7
  • Positive class: Precision 0.86, Recall 0.89, F1 0.88

TweetNLP RoBERTa XLM — finance (maximum probability):

  • Accuracy 0.98, Ratio 0.32
  • Negative class: Precision 0.93, Recall 0.99, F1 0.96
  • Positive class: Precision 1.0, Recall 0.98, F1 0.99

💃 Pysentimiento BERTweet

Pysentimiento is another Python package for sentiment analysis and it offers the fine-tuned version of the model BERTweet for English and different models for several other languages. The base model BERTweet was originally trained by VinAI and its distinctive feature is a very large training corpus of tweets [Nguyen et al., 2020]. For fine-tuning details, I refer you to the paper of Pysentimiento [Pérez et al., 2021]. Let’s see if more tweets during training would improve performance even further.

Here is a brief description of Pysentimiento BERTweet:

  • Transformer RoBERTa architecture (135 million parameters)
  • English + Italian, Portuguese, Spanish as separate models
  • Both word embeddings and model weights of the base model were trained on 850 million tweets
  • The model head was fine-tuned on SemEval 2017
  • Labels are mapped into 3 classes (negative, neutral, positive)
  • Outputs class probabilities
  • 85 seconds for 10 000 texts (1 text = 100 tokens)

Pysentimiento BERTweet identified sentiment signs in most simple examples correctly and managed to detect sentiment degrees in amplified examples by a slight margin. Unfortunately, a more complex negative example was incorrectly labeled as positive, similar to TweetNLP RoBERTa. Also, the polarity values seem to be quite close to the range ends which was more common in the models with 2 classes but not in multiclass models.

The movie was great                 0.99017   -> ✅ Simple positive
The movie was really great 0.99022 -> ✅ Amplified positive
The movie was not great -0.97241 -> ✅ Simple negative
The movie was really not great -0.97329 -> ✅ Amplified negative
The movie was not that great -0.97217 -> ✅ Longer negative
The movie could have been better 0.11472 -> ❌ More complex negative

The classification metrics by Pysentimiento BERTweet are very good (accuracy 0.93–0.94) and are comparable to TweetNLP RoBERTa. This is great performance considering that Pysentimiento BERTweet used only tweets for both training and fine-tuning.

When using a maximum class probability, the ratio of texts classified as negative/positive drops but this is expected for a model with 3 classes.

Pysentimiento BERTweet — yelp (threshold 0):

  • Accuracy 0.93, Ratio 1.0
  • Negative class: Precision 0.97, Recall 0.88, F1 0.93
  • Positive class: Precision 0.89, Recall 0.98, F1 0.93

Pysentimiento BERTweet — yelp (maximum probability):

  • Accuracy 0.96, Ratio 0.92
  • Negative class: Precision 0.98, Recall 0.93, F1 0.95
  • Positive class: Precision 0.94, Recall 0.99, F1 0.96

Pysentimiento BERTweet — tweet (threshold 0):

  • Accuracy 0.94, Ratio 1.0
  • Negative class: Precision 0.97, Recall 0.94, F1 0.95
  • Positive class: Precision 0.9, Recall 0.95, F1 0.92

Pysentimiento BERTweet — tweet (maximum probability):

  • Accuracy 0.98, Ratio 0.79
  • Negative class: Precision 0.98, Recall 0.98, F1 0.98
  • Positive class: Precision 0.96, Recall 0.97, F1 0.97

Pysentimiento BERTweet — finance (threshold 0):

  • Accuracy 0.94, Ratio 1.0
  • Negative class: Precision 0.9, Recall 0.91, F1 0.9
  • Positive class: Precision 0.96, Recall 0.95, F1 0.96

Pysentimiento BERTweet — finance (maximum probability):

  • Accuracy 0.99, Ratio 0.54
  • Negative class: Precision 0.97, Recall 1.0, F1 0.99
  • Positive class: Precision 1.0, Recall 0.99, F1 0.99

Summary

Neural network models and especially transformers advanced a lot in sentiment analysis. They achieve accuracies of >0.9 on binary classification (negative/positive) for 3 different test datasets (product reviews, social media posts, financial phrases) which wasn’t possible before with dictionary models. And although a good GPU is required for reasonable calculation times, in my opinion, it’s worth it. Finally, I would recommend TweetNLP or Pysentimiento since both of these packages offer models for English and other languages. Their models are also available on the HuggingFace Hub in case you prefer using HuggingFace API.

In summary, these are the pros and cons of neural network models:

  • ✅ Higher classification accuracy compared to dictionary models
  • ✅ Good sentiment detection for specialized texts
  • ✅ 10 different languages
  • ❌ Slower calculation times compared to dictionary models
  • ❌ Requires a GPU for reasonable speeds
  • 1 text = 100 tokens

Acknowledgements

Big thanks to Prof. Juan Manuel Pérez for useful remarks about this article.

--

--