Deep learning frameworks and vectorization approaches for sentiment analysis

Published in

Spatial AI

6 min readDec 13, 2017

TLDR: This is a quick tutorial comparing different deep learning architectures and text embedding schemes for the purpose of sentiment analysis.

Introduction and Background

There exists many pipelines for calculating the sentiment of a given text. Between the chosen vectorization scheme (bag-of-words (BoW), word2vec, character level embedding, etc.) to the machine learning algorithm that will interpret these vectors and ultimately decide what the answer should be.

For this example, we’ll look at the task of tweet sentiment analysis while comparing two different text vectorization schemes as well as two different deep learning frameworks.

The first and most common way of encoding text data is at the word level, which represents a BoW solution. In this approach, words that are positively correlated with positive sentiment (‘happy’, ‘love’, etc…) become signals for the classifier that the tweet is likely positive. Conversely, words such as “miserable” or “awful”, get flagged as being correlated with negative sentiment.

Another option to word level embedding is character level embedding. In this approach, instead of keeping a list of every word, we keep a list of all possible characters. This means that instead of keeping a list of 10k words or 100k words, we keep a list of 100ish different characters which come from the alphabet [a->z], numbers [0->9], and any special characters [#, $, %, &, etc.].

The two different deep learning frameworks we’ll be investigating are convolutional neural networks (CNN) and recurrent neural networks (RNN) with long short-term memory gates (LSTM). CNNs have been used with great success on the task of image recognition, and have also proven successful when applied to natural language. RNN-LSTMs on the other side, are very good with time dependent (sequential) datasets. The LSTM memory gates allow the RNNs to “remember” past events within a certain range and act accordingly. The downside of RNN-LSTMs is that they require substantially more training time than their CNN counterparts.

Data preparation

Before we can heat up any GPUs, we need to clean and prepare the data for training. The data we’ll be using is the Sentiment140 dataset which is a freely available corpus of nearly 1.6 million tweets with their respective sentiment polarity.

The first thing we need to do is convert everything to lowercase and remove any hyperlinks. From here, we can remove any user names and correct any formatting inconsistencies. Below is a Python function for completing this task:

def clean_tweet(tweet_raw):    # REMOVE USER NAMES
    tweet_clean = re.sub(r'@.*? ','', tweet_raw)
    tweet_clean = re.sub(r'@\_.*? ','', tweet_clean)    # REMOVE LINKS
    tweet_clean = re.sub(r'http://.*?($| )', '', tweet_clean)    # REMOVE QUOTES
    tweet_clean = tweet_clean.replace('"','')
    tweet_clean = tweet_clean.replace("'",'')    # MISC
    tweet_clean = tweet_clean.replace('  ',' ')
    tweet_clean = tweet_clean.replace('&lt;','<')
    tweet_clean = tweet_clean.replace('&gt;','>')    # REMOVE LEADING EDGE SPACING
    tweet_clean = re.sub(r'^ +','', tweet_clean)    # LOWERCASE
    tweet_clean = tweet_clean.lower()    return tweet_clean

So, now that our data is cleaned and ready to go we need to split it into training and testing sets. It’s usually best practice to use a train/test/validation split but let’s keep it simple for this exercise and split it at 80/20. Additionally, when splitting the data, make sure you keep the same distribution of positive and negative examples in both the training and testing data sets.

As an example to show the difference between character level and word level encoding, lets vectorize the following example tweet randomly taken from the sentiment140 data set:

Tweet: “going to sleep soon for my physics exam at 8am. i had so much fun tonight with my friends”

Using the top 8000 words plus a ‘0’ as a padding or out of range flag, the BoW vector looks like this:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 52, 4, 125, 171, 15, 8, 2453, 466, 28, 2785, 2, 3, 71, 21, 92, 121, 135, 26, 8, 199]

For the character level embedding, the vectorized tweet looks like this:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 60, 62, 128, 92, 60, 6, 31, 62, 6, 63, 29, 127, 127, 30, 6, 63, 62, 62, 92, 6, 90, 62, 93, 6, 129, 132, 6, 30, 28, 132, 63, 128, 59, 63, 6, 127, 32, 125, 129, 6, 125, 31, 6, 18, 125, 129, 78, 6, 128, 6, 28, 125, 27, 6, 63, 62, 6, 129, 131, 59, 28, 6, 90, 131, 92, 6, 31, 62, 92, 128, 60, 28, 31, 6, 64, 128, 31, 28, 6, 129, 132, 6, 90, 93, 128, 127, 92, 27, 63]

The zeros at the front of the vector represent padding since the example tweet was not quite 140 characters or words. Even though 140 words in a tweet is unrealistic, I kept the vectors the same for consistency. In terms of information structure the BoW vector contains 140 dimensions where most dimensions are empty. And the few dimensions that are filled range from zero to 8000 (because we kept the top 8000 words plus an empty ‘0’ flag). Sure, we could have reduced the number of dimensions to maybe 50, or possibly less but this is for a worst case generalization. The character level vector is also 140 dimensions but most of the dimensions are filled. Additionally, each dimension of the character embedding vector range from 0 to 133. Will the more balanced character level vectorization scheme prevail? Or will the highly concentrated BoW approach gain the lead? Let’s find out!

Model Preparation

Finally, now training can begin! Again, for the sake of simplicity I’ll be using the Keras, a Tensor Flow wrapper, for codifying our deep learning system. Below, our RNN-LSTM in Keras couldn’t be any simpler:

model.add(LSTM(128))
model.add(Dropout(0.2))
model.add(Dense(1,activation='sigmoid'))

As for the Keras CNN:

model.add(Convolution1D(64, 3, border_mode='same'))
model.add(Convolution1D(32, 3, border_mode='same'))
model.add(Convolution1D(16, 3, border_mode='same'))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(180,activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(1,activation='sigmoid'))

The ease of use when it comes to Keras is off the charts 👍👍👍.

Results

After running each model for 10 epochs the accuracy for each setup is as follows:

+------------------+----------+--------+
|                  | RNN-LSTM |  CNN   |
+------------------+----------+--------+
| Character Level  |   0.8174 | 0.7902 |
| Word Level (BoW) |   0.8296 | 0.8163 |
+------------------+----------+--------+

Conclusion

Although this is just one, heavily simplified example, it is interesting to see the differences in accuracy based off the vectorization schemes and deep learning frameworks. Although the RNN-LSTM took longer to train, it did perform better in both cases. The CNN + BoW on the other hand trained quickly and put up solid numbers. With a little tweaking, any one of these models could potentially become the best performing solution. It all depends on the architecture you want running in the background. Do you want a lookup table of tens of thousands of individual words, or a smaller dictionary of a hundred or so individual characters?