Best open-source models for sentiment analysis — Part 1: dictionary models
Dictionary modes are really fast but at the price of lower accuracies
Introduction
In this article series, I will try to answer the question that was inspired in the past by one of my data science colleagues (thanks Rachid 😊): what is the best model for sentiment analysis? For this comparison test, I selected 13 popular models that were pre-trained for sentiment analysis and are available as open-source. In Part 1, you will find 4 dictionary models (3 for Python and 1 for R), and in Part 2, I additionally review 9 neural network models.
But first, what is sentiment analysis and why is it important?
Sentiment analysis is the process of determining the opinion, judgment, or emotion behind natural language [Source: Qualtrics].
Sentiment analysis can be a very powerful technique for analyzing customer feedback, monitoring social media, and even predicting stock prices! It is, however, a rather complicated task because it deals with unstructured text data as well as language nuances. Let’s be honest, even humans can’t always get the sentiment right, for example, when dealing with sarcasm.
Metrics
The most common metric for sentiment analysis is polarity which I will be using in this article. In the literature, however, you can also find other metrics, like subjectivity (if you want to analyze biases), emotion (if you want to detect hate speech), etc.
Polarity is typically measured within the range [-1, 1] where -1 corresponds to a strongly negative sentiment, 0 to a neutral sentiment, and +1 to a strongly positive sentiment. Having a polarity value is very useful because it allows defining your own polarity threshold t that separates a neutral class from negative/positive classes (see the figure below).
Some models don’t output a polarity value but instead provide probabilities for different sentiment classes: p-- (strongly negative), p- (negative), p0 (neutral), p+ (positive), p++ (strongly positive). The predicted class will be the one with the maximum probability. To compare these models with the rest, I calculate polarity from probabilities as follows:
If a model outputs probabilities only for negative/positive classes, I use them instead of p-- and p++ ignoring the rest of the terms in the equation.
In this article series, I will be using models with 2-class, 3-class, and 5-class probabilities and to compare them with each other, I decided to calculate classification metrics only for binary classification (negative/positive) by setting the polarity threshold t = 0. Then, the elements of the confusion matrix will be defined as follows:
- True positives (TP) = a sum of positive texts with polarity > 0
- False positives (FP) = a sum of negative texts with polarity > 0
- True negatives (TN) = a sum of negative texts with polarity < 0
- False negatives (FN) = a sum of positive texts with polarity < 0
Here I define the polarity threshold as strictly more (or less) than 0 so if the text polarity equals 0, I consider this text as neutral and exclude it from binary classification. This is especially relevant for dictionary models that often output polarity 0 due to their limited dictionary. As a result, I also report a ratio of texts classified as negative/positive versus all texts in the dataset. If this ratio is less than 1, it means that the model predicted some of the negative/positive texts as neutral. Typically, this ratio will go down and the accuracy will go up when choosing higher polarity thresholds so I plot these graphs as well. Their purpose becomes more clear in Part 2 when I test a model that only output class labels without class probabilities.
❗ Note that a “polarity” that is calculated from class probabilities, isn’t a true polarity and is only used for binary classification using a polarity threshold. This is because class probabilities from machine learning models are often poorly calibrated so they don’t reflect true correctness likelihood [Guo et al., 2017]. Therefore, I wouldn’t compare individual polarity values between different models but only use those values for classification metrics.
Datasets
In this article, I will be using 3 different datasets:
- Yelp product reviews (test split): 10 000 strongly negative reviews (1 star) and 10 000 strongly positive reviews (5 stars) [Zhang et al., 2016]
- TweetEval (sentiment subset, test split): 3972 negative tweets and 2375 positive tweets from SemEval 2017 [Rosenthal et al., 2017]
- Financial phrasebank (sentences_66agree subset, test split): 514 negative sentences and 1168 positive sentences [Malo et al., 2013]
I chose these datasets to cover diverse categories: customer reviews, social media posts, and highly specialized texts. Another requirement was to avoid the most common sentiment datasets that are used to train open-source models. Finally, only negative/positive texts were selected since testing is done using binary classification.
In addition to the above datasets, I also came up with 6 simple examples (see below) to demonstrate how some of the models work, and used 3 emojis to evaluate polarity values that are output by a model.
The movie was great -> Simple positive
The movie was really great -> Amplified positive
The movie was not great -> Simple negative
The movie was really not great -> Amplified negative
The movie was not that great -> Longer negative
The movie could have been better -> More complex negative
✅ = Both the polarity sign and the polarity value are correct
❔ = The polarity sign is correct but the polarity value is off
❌ = The polarity sign is wrong
Computations
The code for this article was executed in this Google Colab notebook. For dictionary models, I used 2 virtual CPUs (Intel(R) Xeon(R) CPU @ 2.20GHz) and 13 GB of RAM. Calculation times of the models are presented as approximate values averaged over 3 rounds and can get even higher, especially when the number of tokens in texts increases.
💧 TextBlob Pattern
TextBlob is a popular Python NLP package that includes the dictionary model for sentiment analysis Pattern. The latter also exists as a separate Python package and provides additional multilingual dictionaries. If you want to learn more details about the model, I refer you to the original paper [De Smedt & Daelemans, 2012].
Here is a brief description of TextBlob Pattern:
- Dictionary and rules architecture
- English via TextBlob + Dutch, French, Italian via Pattern
- Manual annotation of the gold dataset (about 1000 words, mostly adjectives) and automatic extension of ≈3000 words using semantic relatedness
- Rules for valence shifting (negations, intensifiers) using bigrams
- Outputs polarity
- 10 seconds for 10 000 texts (1 text = 100 tokens)
TextBlob Pattern calculates sentiment by averaging polarity scores for each sentence word that was found in its dictionary, and then applying rules for valence shifting, for example, boosting polarity when intensifiers like “very”, “extremely”, etc. are present.
Since valence shifting is done using bigrams, a negation word (e.g., “not”) will influence polarity only if it comes right before the sentiment word (e.g., “not great”) as shown in the simple examples below. If another word comes between them (e.g., “not that great”), then the model fails to recognize the sentiment correctly. More complex negative structures (e.g., “could have been better”) also got the wrong polarity.
On top of that, I noticed that the intensifier word “really” didn’t boost polarity as expected although some other intensifiers like “very” did. This is strange because “really” is present in the dictionary and has an intensity multiplier of 2 so I wonder if there are any bugs in the code.
The movie was great 0.8 -> ✅ Simple positive
The movie was really great 0.8 -> ❔ Amplified positive
The movie was not great -0.4 -> ✅ Simple negative
The movie was really not great -0.4 -> ❔ Amplified negative
The movie was not that great 0.8 -> ❌ Longer negative
The movie could have been better 0.5 -> ❌ More complex negative
The classification results of TextBlob Pattern are decent on all 3 test datasets (accuracy 0.69–0.77) but the model struggles with negative sentiment (recall for the negative class 0.51–0.67). This is also confirmed by the average polarity of the negative class being slightly below 0 for all 3 datasets instead of having some reasonable negative value.
TextBlob Pattern managed to classify almost all Yelp reviews as negative/positive (ratio 0.98) but did much worse with tweets (ratio 0.64) and financial phrases (ratio 0.54). This is also visible on histograms as large spikes for polarity 0. These low ratios were expected since tweets might contain special slang or characters (for example, emojis) and financial phrases might include highly specialized terms that are not in the dictionary of TextBlob Pattern.
TextBlob Pattern — yelp (threshold 0):
- Accuracy 0.75, Ratio 0.98
- Negative class: Precision 0.97, Recall 0.51, F1 0.67
- Positive class: Precision 0.67, Recall 0.98, F1 0.8
TextBlob Pattern — tweet (threshold 0):
- Accuracy 0.69, Ratio 0.64
- Negative class: Precision 0.9, Recall 0.55, F1 0.68
- Positive class: Precision 0.57, Recall 0.91, F1 0.7
TextBlob Pattern — finance (threshold 0):
- Accuracy 0.77, Ratio 0.54
- Negative class: Precision 0.66, Recall 0.67, F1 0.67
- Positive class: Precision 0.83, Recall 0.83, F1 0.83
💧 TextBlob Naive Bayes
In addition to Pattern, TextBlob also includes the Naive Bayes sentiment classifier. This is a quite old model that became popular in the past for spam filters [Sahami et al., 1998]. It doesn’t usually have high accuracies but sometimes it is still used thanks to its simplicity.
Here is a brief description of TextBlob Naive Bayes:
- Conditional probability architecture
- English
- Is trained automatically during every session using the IMDB movie review dataset
- Labels are mapped into 2 classes (negative, positive)
- Outputs class probabilities
- 20 seconds for 10 000 texts (1 text = 100 tokens)
Strictly speaking, Naive Bayes isn’t a dictionary model but I included it here because of its similar behavior. During training Naive Bayes generates a separate feature for every word from the training corpus which can be considered its “dictionary”. If any of the feature words are found in a test sentence, then one of the classes will have a higher probability and the sentence polarity will be not 0 (negative/positive). But if a test sentence doesn’t contain any of the feature words (although it is quite unlikely because of all the stop words), then both classes will have equal probabilities and the sentence polarity will be 0 (neutral).
Simple examples show that TextBlob Naive Bayes poorly recognizes negative sentiment which is due to its underlying assumption of conditional independence. It says that the probability of finding a certain word depends only on the text class and not on the other words in the text. So if the word “not” appears more often in the positive training dataset rather than in the negative one (which is the case for the training IMDB dataset!), then “not” will contribute to positive sentiment, no matter what words come after it. As a result, valence shifting by negations or intensifiers doesn’t exist in Naive Bayes.
The movie was great 0.11474 -> ✅ Simple positive
The movie was really great 0.10906 -> ❔ Amplified positive
The movie was not great 0.1253 -> ❌ Simple negative
The movie was really not great 0.11962 -> ❌ Amplified negative
The movie was not that great 0.1278 -> ❌ Longer negative
The movie could have been better -0.28085 -> ✅ More complex negative
The classification results of TextBlob Naive Bayes are not very good (accuracy 0.48–0.67) and this is largely due to its poor detection of negative sentiment as explained above. The model recognized some negative reviews from the Yelp dataset (recall for the negative class 0.47) but completely failed with tweets (recall for the negative class 0.34) and financial phrases (recall for the negative class 0.16). This is also visible on histograms where polarities for negative tweets and negative financial phrases are mostly positive.
Since TextBlob Naive Bayes outputs class probabilities, it’s also possible to calculate its classification metrics based on a maximum class probability. But since there are only 2 classes, this approach is equivalent to using polarity threshold 0 as was reported previously. In case a model outputs more than 2 classes, classification results using maximum class probability will be different and will be also reported for those models.
TextBlob Naive Bayes — yelp (threshold 0):
- Accuracy 0.67, Ratio 1.0
- Negative class: Precision 0.78, Recall 0.47, F1 0.58
- Positive class: Precision 0.62, Recall 0.87, F1 0.72
TextBlob Naive Bayes — tweet (threshold 0):
- Accuracy 0.48, Ratio 1.0
- Negative class: Precision 0.68, Recall 0.34, F1 0.45
- Positive class: Precision 0.4, Recall 0.73, F1 0.51
TextBlob Naive Bayes — finance (threshold 0):
- Accuracy 0.66, Ratio 1.0
- Negative class: Precision 0.36, Recall 0.16, F1 0.23
- Positive class: Precision 0.7, Recall 0.87, F1 0.78
🤖 NLTK VADER
One of the most popular Python NLP packages NLTK includes the dictionary model called VADER which stands for “Valence Aware Dictionary and sEntiment Reasoner”. VADER also exists as a separate Python package but if you have NLTK, then you can use this model out of the box. The main selling point of VADER is being fine tuned for social media texts and it claims to achieve high accuracies for this task [Hutto & Gilbert, 2014]. Thanks to its popularity, VADER was also recreated in many other languages, including R, but unfortunately the R model takes almost 30 minutes to process 10 000 texts so I would advise against it for R users unless you have a little amount of texts.
Here is a brief description of NLTK VADER:
- Dictionary and rules architecture
- English
- Manual annotation of 9000 words that contain emoticons, slang, and acronyms typically used in social media (for example, “lol”)
- Rules for valence shifting (negations, intensifiers) using trigrams
- Outputs polarity
- 5 seconds for 10 000 texts (1 text = 100 tokens)
NLTK VADER calculates polarity by adding up sentiment scores in the range [-4, 4] for each sentence word that is found in its dictionary, applying rules, and normalizing the final score to [-1, 1] using the function normalize
as shown below. This approach shifts polarity values further away from 0 as compared to simple averaging in TextBlob Pattern and will be visible on histograms.
def normalize(score, alpha=15):
"""
Normalize the score to be between -1 and 1 using an alpha that
approximates the max expected value
"""
norm_score = score / math.sqrt((score * score) + alpha)
if norm_score < -1.0:
return -1.0
elif norm_score > 1.0:
return 1.0
else:
return norm_score
Valence shifting using trigrams by NLTK VADER performs better on simple examples below compared to TextBlob Pattern that uses bigrams. Unfortunately, it still fails with more complex negative structures. Unlike TextBlob Pattern, the intensifier word “really” was correctly recognized by NLTK VADER so I think that there is definitely a bug in the former.
The movie was great 0.62 -> ✅ Simple positive
The movie was really great 0.66 -> ✅ Amplified positive
The movie was not great -0.51 -> ✅ Simple negative
The movie was really not great -0.55 -> ✅ Amplified negative
The movie was not that great -0.51 -> ✅ Longer negative
The movie could have been better 0.44 -> ❌ More complex negative
The classification results by NLTK VADER are decent (accuracy 0.76–0.78) and comparable to TextBlob Pattern. NLTK VADER also struggles with the negative sentiment (recall for the negative class 0.42–0.68) but unlike TextBlob Pattern, NLTK VADER recognized more texts as negative/positive, especially with tweets (ratio 0.68) and financial phrases (ratio 0.77). This is probably due to the larger dictionary of NLTK VADER which was designed specifically for social media texts.
VADER — yelp (threshold 0):
- Accuracy 0.78, Ratio 0.99
- Negative class: Precision 0.96, Recall 0.58, F1 0.73
- Positive class: Precision 0.7, Recall 0.98, F1 0.82
VADER — tweet (threshold 0):
- Accuracy 0.76, Ratio 0.79
- Negative class: Precision 0.93, Recall 0.68, F1 0.79
- Positive class: Precision 0.61, Recall 0.91, F1 0.73
VADER — finance (threshold 0):
- Accuracy 0.78, Ratio 0.77
- Negative class: Precision 0.7, Recall 0.42, F1 0.53
- Positive class: Precision 0.8, Recall 0.93, F1 0.86
🎓 Sentimentr (R)
For R enthusiasts, I also added the package Sentimentr to the comparison. According to its own benchmarks, Sentimentr outperforms other dictionary R models such as Syuzhet and Meanr so I left them out. Moreover, Syuzhet and Meanr don’t have valence shifting which is an important component of dictionary models.
Here is a brief description of Sentimentr:
- Dictionary and rules architecture
- English
- Combination of 9 lexicons totaling 11710 words that were either manually annotated or automatically extended
- Outputs polarity
- Rules for valence shifting (negations, intensifiers) using 4 words before and 2 words after
- 45 seconds for 10 000 texts (1 text = 100 tokens)
Sentimentr calculates polarity in a more complex way than other dictionary models but in the end, it sums up polarity scores of word clusters and divides them by the square root of the word count. Since this method results in an unbounded polarity, I additionally clip its values to the range [-1, 1] which might lead to a count increase at the ends of the range on histograms.
Valence shifting by Sentimentr works quite well on the simple examples below, except for the negative amplification. Still, I was happy to see that polarity signs for all examples were correctly identified, even for a more complex negative structure.
The movie was great 0.25 -> ✅ Simple positive
The movie was really great 0.4 -> ✅ Amplified positive
The movie was not great -0.22 -> ✅ Simple negative
The movie was really not great -0.041 -> ❔ Amplified negative
The movie was not that great -0.2 -> ✅ Longer negative
The movie could have been better -0.1 -> ✅ More complex negative
The classification results by Sentimentr are decent (accuracy 0.74–0.83) and slightly better than NLTK VADER although identifying the negative sentiment remains a common challenge for dictionary models (recall for the negative class 0.62–0.69). Interestingly, the package author came to the same conclusion when comparing Sentimentr to other R models (see here). Nevertheless, Sentimentr can be considered a worthy alternative for Python dictionary models.
Sentimentr — yelp (threshold 0):
- Accuracy 0.79, Ratio 0.99
- Negative class: Precision 0.96, Recall 0.62, F1 0.75
- Positive class: Precision 0.72, Recall 0.97, F1 0.83
Sentimentr — tweet (threshold 0):
- Accuracy 0.74, Ratio 0.9
- Negative class: Precision 0.9, Recall 0.68, F1 0.77
- Positive class: Precision 0.59, Recall 0.87, F1 0.7
Sentimentr — finance (threshold 0):
- Accuracy 0.83, Ratio 0.79
- Negative class: Precision 0.7, Recall 0.69, F1 0.7
- Positive class: Precision 0.88, Recall 0.88, F1 0.88
Summary
Dictionary models would be a good choice if you care about fast computation speeds and don’t mind lower accuracies. For Python users, I would recommend NLTK VADER and for R users, I would go for Sentimentr. Languages other than English are quite limited but some are available in the original package Pattern.
In summary, these are the pros and cons of dictionary models:
- ✅ Run fast on a CPU
- ❌ Not very good at detecting negative sentiment
- ❌ Require larger dictionaries for specialized texts
💡 If you want to find out how much better neural networks score on sentiment analysis, don’t miss Part 2 of this article series.
- 1 text = 100 tokens
Acknowledgements
Big thanks to Prof. Juan Manuel Pérez for useful remarks about this article.