A simple deep neural network that beats TextBlob and VADER packages at sentiment classification

Kaihua Ding, Ph.D.
9 min readNov 9, 2021

The increasing popularity of Python and open source natural language processing (NLP) packages, such as TextBlob and VADER, made sentiment analysis easy and widely available for NLP tasks. However, these ready-to-use NLP packages might come with some caveats. Though packages like TextBlob and VADER are great NLP prototyping tools, they are not terribly accurate even for basic sentence level sentiment classification task.

In this article, we will first investigate some popular models used in NLP. Then, we will use the sentiment classification as an example to show how we can create a simple deep neural networks (DNN), without any fancy neural network architectures, e.g., without recurrent units or attention. Our simple DNN would exhibit a testing accuracy that beats both TextBlob and VADER packages.

The low accuracy caveat

Before we start building our DNN, let’s examine how ready-to-use sentiment packages like TextBlob and VADER perform on a toy problem, “I don’t think this movie is good”, which clearly expresses a negative sentiment.

First, let’s load both TextBlob and VADER libraries,

from textblob import TextBlob
import nltk
nltk.download('vader_lexicon')
!pip3 install -U nltk[twitter]
from nltk.sentiment.vader import SentimentIntensityAnalyzer # ask VADER to use Twitter lexicon for fairness of comparison

Second, we define a helper function for convenience,

def analize_sentiment(sentence, option='VADER'):
'''
Utility function to classify the polarity of a tweet
using textblob.
'''
if option == 'VADER':
analysis = SentimentIntensityAnalyzer().polarity_scores(sentence)
analysis = analysis['compound'] # take the compound score
elif option == "TextBlob":
analysis = TextBlob(tweet)
analysis = analysis.sentiment.polarity
if analysis > 0:
return "positive"
elif analysis == 0:
return "neutral"
else:
return "negative"

Let’s test our toy example, “I don’t think this movie is good”.

sentence = "I don't think this movie is good"
TextBlob_result = analize_sentiment(sentence, "TextBlob")
VADER_result = analize_sentiment(sentence, "VADER")
print(f"TextBlob's sentiment analysis result for '{sentence}' is {TextBlob_result}.")
print(f"VADER's sentiment analysis result for '{sentence}' is {VADER_result}.")

We shall end up with the following result,

TextBlob's sentiment analysis result for 'I don't think this movie is good' is Positive.
VADER's sentiment analysis result for 'I don't think this movie is good' is Positive.

The above results are clearly wrong. Why? NLP packages like TextBlob and VADER are bag-of-n-grams model. Bag-of-n-grams model, as model’s name suggest, treats natural language as a bag of n-grams. Words’ order of natural language is not considered by bag-of-n-grams model. Everything is shoved into a mixed bag.

Figure 1. Bag-of-bi-grams illustration. VADER and TextBlob specifically uses bag-of-uni-gram by default. However, you can tune them to use n-grams of your choice. No matter what integer number n is, the nature of resulting models are all mixed “bags” that do not explicitly consider longer sequence order.

This bag-of-n-grams model shown in Figure.1 can induce a serious accuracy problem. The semantics of natural language are strongly related to word sequence order. Consider the following 2 sentences in Figure. 2.

Figure. 2 Word sequence order determines semantics in natural language.

Figure. 2 epitomizes the problem faced by bag-of-n-grams. Word sequence order is way too important in natural language. Chop up our input natural language into n-grams, then shove them into a bag might lead to a low accuracy model.

Besides of accuracy concerns, bag-of-n-gram model is not practical for large NLP tasks either. Models that uses n-gram, e.g. navie Bayes or bag-of-n-grams, must store a dictionary of n-gram in order to make prediction. This extra memory burden can rule out n-grams model for NLP tasks that need a large vocabulary set. In addition, natural language is highly flexible in its sequence order but n-gram dictionary may not be. New combination of n-gram are generated every day, which will not be found in n-gram dictionary upon running n-gram model. n-gram based language model does not generalize well to previously unseen data.

Then, the question becomes how to construct a model that can process the sequential order information of natural languages. There are many possible solutions!

I will mention 3 useful natural language models that work better than n-grams models (navie Bayes or bag-of-n-grams, etc).

First solution — a simple deep neural networks (DNN). DNN is a crude mimic of how human brain functions with interconnected neurons. Unlike n-grams model, where natural language’s sequence information are chopped and lost, DNN model directly takes the natural language sequence as input. Thus, DNN considers the word order sequence information. The downside of using DNN to model natural language is that DNNs are not great at modeling very long sequences and DNN input data must always be padded during pre-processing steps to make them of equal length. For the purpose of classifying NLTK Twitter sentiments data, DNN would suffice. A typical DNN network is drawn in Figure 3.

Figure 3. A typical deep neural network, made up of neurons and fully connected activation functions

Second solution — there is a more sophisticated deep neural network architecture, named recurrent neural network (RNN), specifically designed to model sequential data. RNN itself behaves like a sequence. Figure 4. shows a typical RNN architecture. As a result, RNN would take sequence information into modeling consideration by design.

Figure 4. Recurrent Neural Networks (RNN). The recurrent units are concatenated together into a sequence.

Third solution — the state-of-the-art natural language models are almost always attention models. Attention models not only consider natural language’s sequence information, attention models can also focus its attention on specific parts of sequence by assigning different attention weights to each words. Some attention models are built on top of RNNs. Other attention models construct sequence information independently. Attention mechanism is still an active research area. Figure 5 illustrates how a typical attention mechanism works.

Figure 5. A typical attention mechanism.

Our simple DNN

Now that we have talked about various NLP models. I want to demonstrate that even a simple DNN model without fancy recurrent units or attention mechanism can still produce a more accurate NLP model than n-gram model (TextBlob and VADER) for Twitter sentiment classification. The following model will serve as an example on just what a difference considering sequence information can make to model’s accuracy.

I will explain the important bits of this model in the following sections and you can find the IPython notebook accompanying this article in this Github repository.

Training data and labels

Training data and training labels for this DNN model are downloaded from NLTK. We will use labeled ‘twitter_sample’ sentiment data from NLTK.

import nltk
nltk.download('twitter_samples')
def load_tweets():
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
# Construct the training data
train_pos = all_positive_tweets[:4000]# generating training set for positive tweets
train_neg = all_negative_tweets[:4000] # generating training set for nagative tweets
# The rest will be used a validation data set

Model

The DNN model in this article is constructed using Trax, a deep learning framework, from the family of TensorFlow frameworks. Both Trax and TensorFlow are maintained by Google Brain.

If you are unfamiliar with either Trax or TensorFlow, don’t worry about the syntax in the following code block. The Github repository accompanying this article can be found here, where you can find all the details of model implementation.

def classifier(vocab_size, embedding_dim, output_dim, mode='train'):
# create embedding layer
embed_layer = tl.Embedding(
vocab_size=vocab_size, # Size of the vocabulary
d_feature=embedding_dim) # Embedding dimension

# Create a mean layer, to create an "average" word embedding
mean_layer = tl.Mean(axis=1)

# Create a dense layer, one unit for each output
dense_output_layer = tl.Dense(n_units = output_dim)
# Create the log softmax layer (no parameters needed)
log_softmax_layer = tl.LogSoftmax()

# Use tl.Serial combinator
model = tl.Serial(
embed_layer, # embedding layer
mean_layer, # mean layer
dense_output_layer, # dense output layer
log_softmax_layer # log softmax layer
)

# return the model of type
return model
model = classifier()display(model)

The printout of the above model will show us the model’ structure,

Serial[
Embedding_10000_256
Mean
Dense_2
LogSoftmax
]

The above printout shows that there are no fancy deep learning architectures involved in our DNN. We can go ahead to train our model now.

I’m going to skip the model training part and more details can be found in the accompanying Github repository. I have also attached the training weights for anyone who would like to play with this pre-trained DNN classifier. For this NLTK labeled Twitter sentiment data, the accuracy achieved on the validation data set (20% of the NLTK ‘twitter_sample’ data) is %99.50. Noted, NLTK Twitter sentiment data set is not a large data set. So, if you plan to use this model for natural language that are not similar to the 10,000 NLTK ‘twitter_sample’ tweets, I recommend re-train / transfer learning this DNN model.

Testing and comparison against TextBlob and VADER on a hand engineered hard test set (double negation, idioms, etc)

Since TextBlob and VADER do not consider sequence order of natural languages and treat natural language as a bag-of-n-grams. I thought it would be fun if I purposefully construct a hard testing data set!

Figure 6. Screenshot of the tabulated hard testing data. You can find this table from the accompany Github repository here. These testing data contain double negations and idioms.

Let’s compare our DNN, TextBlob and VADER’s accuracy on this testing data set, using the following code block.

no_hand_engineered_tests = len(hand_engineered_tests)no_correct_classification_TextBlob = 0
no_correct_classification_VADER = 0
no_correct_classification_deep_neural_nets = 0
for i in range(no_hand_engineered_tests):sample = hand_engineered_tests[i]
sentence = sample[0]
sentiment = sample[1]
# TextBlob
if analize_sentiment(sentence, option='TextBlob') == sentiment:
no_correct_classification_TextBlob+=1
# VADER
if analize_sentiment(sentence, option='VADER') == sentiment:
no_correct_classification_VADER+=1
# deep neural nets
if predict(sentence)[1] == sentiment:
no_correct_classification_deep_neural_nets +=1
print(f"TextBlob classified {no_correct_classification_TextBlob} / {no_hand_engineered_tests} correctly. \n")
print(f"VADER classified {no_correct_classification_VADER} / {no_hand_engineered_tests} correctly. \n")
print(f"The deep neural nets classified {no_correct_classification_deep_neural_nets} / {no_hand_engineered_tests} correctly.\n")

The output will look like the following,

TextBlob classified 8 / 12 correctly.VADER classified 6 / 12 correctly.The deep neural net classified 10 / 12 correctly.

Hooray! Our DNN outperforms both TextBlob and VADER.

I want to end the testing session with the following example on the sentence,

“I can not believe how fantastic this movie was”,

which has a clear positive sentiment and will be classified wrongly by both VADER and TextBlob.

# try a negative sentence
sentence = "I can not believe how fantastic this movie was."
pred, sentiment = predict(sentence)
print(f"The deep neural net classifies sentiment of the sentence: '{sentence}', to be {sentiment}.")
print(f"TextBlot classifies sentiment of the sentence: '{sentence}', to be {analize_sentiment(sentence, option='TextBlob')}.")
print(f"VADER classifies sentiment of the sentence: '{sentence}', to be {analize_sentiment(sentence, option='VADER')}.")

The printed results are as follows.

The deep neural net classifies sentiment of the sentence: 'I can not believe how fantastic this movie was.', to be positive.TextBlob classifies sentiment of the sentence: 'I can not believe how fantastic this movie was.', to be negative.VADER classifies sentiment of the sentence: 'I can not believe how fantastic this movie was.', to be negative.

Hooray! Our DNN triumphed again.

Conclusion

By explaining how various NLP models work under the hood, I hope to convey the message that natural language’s sequence order information is very important in NLP modeling. Deep learning based architectures (DNN, RNN, or atention) tend to model the sequence nature of natural language better than n-gram based methods (navie Bayes or bag-of-n-gram).

Now that you end up with a simple DNN which outperforms TextBlob and VADER in a hand-engineered Twitter test set, I hope this only encourages you to explore more deep learning models in the future.

TextBlob and VADER are still great prototyping tools and fantastically easy to use. However, please be aware of its bag-of-n-grams nature under the hood. If your NLP tasks demand higher accuracy, you can always design your own language model! n-gram models are still useful in NLP metrics, e.g., Bleu and Rouge scores. Learning about them is still helpful!

The Github repository accompanying this article can be found here.

--

--