The Startup
Published in

The Startup

Getting Sentimental To A Degree

by Ali Sayyed, Brett Nesfeder, Callie Gilmore, Siddhant Chauhan, and Samir Epili


The purpose of this project is to predict the keywords and phrases that portray the sentiment of a given tweet. We aim to accomplish this by using CountVectorization and BERT-based models. Success is scored through Jaccard Similarity scores that display how closely our predicted text matches the actual text. This project builds on concepts of Text Analysis while adding a predictive component on structured and unstructured data.

Introduction & Background

Businesses these days have an online presence, it’s become inevitable especially if companies want to remain competitive and relevant. It can be difficult to understand how online platforms are being perceived by users. This is especially important as the brand image can make or break a company.

Positive and negative online interactions can result in increased brand awareness for better or worse. Brands can go “viral”, boosting a brand's reputation, or be “canceled” which has the potential to ruin a company.

By understanding online perceptions of business, we can make recommendations for companies that should utilize their online presence.

Data Collection & Description

The data was imported from a Kaggle competition in which the goal was to predict the part of the word or phrase that reflected the sentiment of the given tweet. We worked with two datasets, training and a testing one. The training dataset contained 27,482 rows and the testing dataset had 3,535 rows. They both had the columns of TextID, the extracted tweet, and the sentiment associated with the tweet. The training dataset also contained a column for the part of the tweet that reflected the sentiment, but the testing dataset did not as the goal was to predict the text that portrays the sentiment of the tweet and then compare that to the actual selected text.

Data Pre-Processing & Exploration

Data preprocessing was slightly different for each of the models that we ran so preprocessing steps are discussed in each of those sections.

Wanting to learn more about our data, we created a few visualizations for exploratory analysis. This first visualization displays the distribution of sentiment for the tweets in our dataset. Most tweets have a neutral sentiment followed by positive tweets and finally negative tweets. This distribution is logical as many people simply tweet to tweet with no sentiment intended. It is an interesting way to look at the data and a good thing that there are more positive tweets than negative tweets.

These next visualizations show the most common words with respect to the sentiment of the tweet. This information was good to keep in mind when we began predicting the text that reflected the sentiment of the tweet as it gave us an idea of which words we might see most.

We were also interested in seeing which words were unique to sentiment and accomplished this by creating plots for each sentiment. A takeaway from these plots was that we now knew which words strictly appeared with a certain sentiment and therefore if we saw these words appearing with another sentiment, we had made a mistake.

Our last group of plots created for exploratory analysis simply gave us another way to examine and understand our data. By looking at the WordClouds, we got a better idea of which words would appear most often with a certain sentiment.

Learning & Modeling

Count VectorizerColab

For our first model, we decided to try an alteration of Naive Bayes using CountVectorizer. We chose this model first because it was a relatively simple, manual, and straight-forward process to begin that would help us understand the data and the problem better. The idea for this model was to predict the text that portrays the given sentiment using only the word counts given from the CountVectorizer. So, we found the weights of the words used the most often in tweets of each type of sentiment and calculated a score based on the weights for subsets of the tweets. For positive and negative tweets, the predicted text was the subset of the tweet that had the highest score. For neutral tweets, it will return the entire tweet.

To start, we removed null values, cleaned, lemmatized, and removed stopwords. However, it is important to note that we did not remove words like ‘no’ or ‘not’ as even though they are considered stop words, they also project a negative sentiment so it’s important that those words be included and are not removed. We separated the train set further into a training and validation set. The purpose of this was to train the model on the training set and then apply it to the validation set.

Next, we separated the different sentiments into their own data frames so we then had 3 data frames: tweets with a positive sentiment, tweets with negative sentiment, and tweets with a neutral sentiment. From there, we ran the CountVectorizer function on all 3 data frames.

The CountVectorizer function turns these tweets into a matrix of token counts. From this matrix of token counts, we created a dictionary of known words in the tweets of each sentiment with the values portraying the calculated weights of each word. However, we needed to account for the fact that many words will be used in tweets of all sentiments so we reassigned the dictionary values by subtracting the proportion of tweets in the other sentiments that use that word.

And to make our predictions, we searched each tweet for phrases or keywords that have the highest scores from our dictionaries. The phrases or words with the highest score become our predicted text for positive and negatives sentiment tweets. For neutral tweets, we return the entire tweet.

To test the accuracy of this model, we compared our predicted text with the actual selected text through a Jaccard Similarity Score. This model performed well with a score of .66597.

Lastly, we applied the model to the original test set to come up with the predicted selected text for that set of data. A Jaccard similarity score cannot be computed for the test set because it does not have the actual selected text. So, by applying it to the test set, we are not testing the model but rather using it in practice to display how this could be applied to other sets of data to determine the important keywords or phrases in the tweets.


Let’s start by first explaining what the BERT model is composed of. BERT is backed by transformers and at its core based on attention.

Attention works by understanding the different contextual relationship between different words

Most other models such as RNN and LSTM models work by taking in each input sequentially. In simple words, going from left to right or right to left. BERT models are non-directional as these models read the whole sentence as the input instead of sequential ordering them. This is where the contextual learning of the information occurs with respect to the input sentence.

BERT models are commonly used for two tasks:

  • Masked language model
  • Next Sentence Prediction

Masked language models work by performing a fill in the blank task. The model uses context words surrounding a set MASK token which is then used to try and predict the MASKED word.

Next Sentence Prediction works by taking in pairs of sentences as inputs and learns to predict if the second sentence in the pair is the subsequent sentence from the original document.

In our context, we used the pre-trained models on the above tasks for our implementation. The key point is each word in the tweet was tokenized into word pieces and each word was allowed to have more than one subwords

Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types — word, character, and subword tokenization.

For BERT, we used a WordPiece tokenizer which essentially assigns letters to token representations and merges them based on if they maximize the likelihood of the training data once added to the vocabulary. Therefore, the sequence appears most frequently together than its constituents appear separately relative to all other symbol pairs, it will be added to the tokenized dictionary.

For example, agg is merged together as a separate token because agg appears frequently enough in the training data to be representative of the vocabulary distribution, just as the letters would appear separately. So if the sequence appears most frequently together than its constituents appear separately relative to all other symbol pairs, it will be added to the tokenized dictionary.

To find transformers for our models we used the Huggingface library for transformers which works on our tweets data.

Next, we wanted to visualize the attention heads. In this visualization, the columns are inputs taken as a tweet. Left column as the selected text column and right column as the predicted text column from our dataset. BERT allows running all the words from the tweet in parallel along with 12 layers in a single run. Here is a walkthrough explaining the attention head views

Using transfer learning, the pretrained BERT-base uncased model was used with 12-layers, 768 hidden layers, and 12 specialized heads. The decision to use this model was because our data set was entirely lower-case letters. Using TensorFlow, two additional layers (dense and softmax) were added after the BERT model for our specialized QA task. The softmax layer is essential in interpreting the output logits as a 1 or 0. The model weights were then updated using the Adam optimizer to calculate the error between the true and predicted probability distributions of the start and end tokens. Additionally, the learning rate, number of epochs, and batch size were set to 3e-5, 3, and 64, respectively.


Our next model focused on building on the capabilities of BERT. Robustly Optimized BERT Pre-training Approach. roBERTa iterates on BERTs training procedures and allows for training on the model for a longer time period, bigger batches over more data, and longer sequences using a dynamic masking pattern

For example, BERT can train on 16 gigabytes while roBERTa can train on 161 gigabytes of data. The architecture of the roBERTa large model is similar to that of BERT, but roughly double the size due to the larger amount of training data at 24 layers, 1024 hidden layers, and 16 specialized heads. The learning rate and the number of epochs were kept the same as our BERT model for this run, however, a smaller batch size of 32 was used to more efficiently use the memory allocated by the GPU. The dense layer used in the BERT was replaced with a 1d-convolutional layer as convolving over the length of one tweet can provide insights on contextual patterns that may occur in similar tweets. Another insight is that the size of the roBERTa model (base vs. large) does not make a significant difference in performance with these specific parameters. Improvement of this model could involve further experimentation of the QA answer heads, using a pre-trained roBERTa model for QA, or changing hyperparameters (i.e. epochs, learning rate, etc).

More information about roBERTA can be found here

Results & Conclusion

Model Scores

We ran a total of three different models.

  • Count Vectorizer with a Jaccard score of 0.6660
  • BERT with a Jaccard score of 0.6020
  • roBERTa with a Jaccard score of 0.6865

Business Applications

You might be wondering, “How are companies using models like BERT to enhance their businesses?” and you’d be surprised by the answer. BERT models have been used by Google to enhance its search engine performance. Another company, Facebook were the one who inadvertently created roBERTA for content moderation. More specifically, the algorithm was made to build statistical images of what hate speech or bully looks like in any language. Bert also performs well when implemented in chatbots, and Wayfair is a company that is doing exactly that. They use the model to extract information from product reviews, and surveys to understand their customers better.

This makes one thing certain, natural language processing models are the future when it comes to optimizing a company’s ability to perform for their customers.

To wrap things up, we were able to obtain insights that allowed us to determine the correlation between keywords/phrases and their predicted sentiments (with the use of our Countvectorizer mode and the manual dictionary we built). Moreover, our models were able to accomplish the ability to identify which words and phrases in tweets display the overall sentiment of the tweet.

We recognize that our model is most useful when sentiment has already been identified. In the future, we would like to take this one step further by scraping custom twitter data (with the use of Twitter API) and then use our roBERTa models to first classify tweets then identify what the sentimentally charged phrases were.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sidd Chauhan

AI Consultant | Youtuber | Podcaster | Writer | Featured in the Startup Publication