Predicting the sentiment of a tweet using NLP and Classification techniques

Published in

Patrick’s notes

9 min readJan 21, 2022

While Twitter has a love-hate relationship for many people, what we can all agree on is it represents one of the best sources for understanding public opinions and sentiment. Therefore. it’s no surprise to see growing levels of academic research using data obtained from the platform with various natural language processing (NLP) techniques applied to gather crucial insights across a variety of domains (commercial, government etc.). For this article, we are using a Coronavirus twitter dataset obtained from Kaggle and various classification techniques to predict the sentiment of a tweet. This tutorial will explain the techniques used and how this can be implemented using various sklearn packages. Just a note that if you want to support my content then please don’t forget to subscribe to Patrick’s notes for future updates.

*Figure 1: NLP and Twitter go hand in hand*

The full code and data repository can be obtained from my Github profile using this link. The dataset we use contains two CSV files that have been read into Pandas, the data has been pulled from Twitter and is related to COVID with the sentiment of each tweet being manually tagged. There are five sentiment classification tags for this dataset: extremely negative, negative, neutral, positive and extremely positive.

From figure 2, there is an imbalanced dataset that we should correct for. If you want to understand about imbalanced datasets in more detail then see my previous article. We need to create an equal balance of each of these datasets. As there is a large number of instances already in the dataset, we conduct the process of random undersampling so that positive, negative, neutral and extremely positive have the same number of instances as extremely negative. The code for this is shown below as well as the updated plot:

*Figure 3: Plot of tweets by sentiment after undersampling*

Text preprocessing

Machine learning modelling is most effective on text data if it has been processed beforehand. Firstly, any stopwords are removed from the text data, these are essentially words that are defined as unnecessary to machine learning modelling as they have no distinctive element to them. For example, words such as “the”, “and” and “sometime” are removed from the textual data as they are part of the basic list provided by Spacy. If you want to learn more about stopwords then I recommend this website as a good starting point.

Although, the basic stopwords are removed, we review the top words again to check if there are any obvious words that should also be omitted. A secondary view shows that “I” should be added to the stopword list and the text reprocessed. The code and comments are in the Jupyter notebook on Github.

The other processing technique we use for this is applying the text through the tweet preprocessor package. This supports cleaning and parsing of:

· URLs

· Hashtags

· Mentions

· Reserved words (RT, FAV)

· Emojis

· Smileys

Once this is completed, we remove any other punctuation from the text using the str.replace functionality.

Text vectorization

Text vectorization allows us to represent each tweet in a vectorized form (link to linear algebra refresh if you need a refresher on vectors), where each word in a tweet is represented by a value. The value that the word takes depends on how important it is deemed to the machine learning modelling process. In this example, each tweet is converted into a vector with each word value represented by its “term frequency-inverse document frequency” (TF-IDF).

The theory behind TF-IDF?

We can best explain TF-IDF by assessing both components separately. Firstly, term frequency is the number of times a word appears in a document/tweet divided by the number of words in that document/tweet. This is done to normalize for the fact that words in documents/tweets will be longer, the more likely the word is to occur. We don’t just use term frequency to vectorize text as it will result in common words in the corpus such as “coronavirus” having a great impact on the model despite the fact it is likely that these words would have very limited impact on sentiment given all tweets are about COVID-19.

As a result, the term frequency is multiplied by the inverse document frequency. Document frequency is the number of documents/tweets that the word is present in as a proportion of the total. The formula for the inverse document frequency is shown below:

Therefore, the overall TF-IDF equation can be shown as follows:

When implementing this in Python, we first extract the text and sentiment out of Pandas as these are the only two data columns that we are using. Then we apply the scikit-learn package TfidfVectorizer to convert the text into vectorized form. The comments and code for how to do this are shown below:

Now onto the actual classification

Once the text has been vectorized, we are going to use two notable classification techniques; naïve bayes and support vector machines and we will judge the effectiveness of these techniques on the test dataset.

A brief explanation of naïve bayes

Naïve bayes is a fairly rudimentary classification technique. This is taken from the naïve bayes model where the following formula is derived:

Terminology alert, in the algebra equation above, we explain the following components as follows:

· P(A|B) = probability of A occurring given B has occurred

· P(A) = prior probability of A occurring

· P(B) = prior probability of B occurring

For this problem, we are looking to establish:

What the equation above essentially states is that we are trying to workout the probability of each sentiment given the words for that specific tweet, which can be shown using the bayes formula on the right. The sentiment with the highest probability is then classified as that for that tweet.

So, we will look at each of these formulas on the right in turn:

P(Sentiment) is the prior probability of the tweet being classified by each sentiment. For example, the probability of a tweet being a negative sentiment is the number of tweets that have a negative sentiment divided by the total number of tweets.

P(words in tweet) is the prior probability of the words being in that tweet. So how does this work in practice, say when classifying a tweet, it contains the words “overall I liked the movie”. Well there is a high likelihood that this did not occur in our training dataset. Well, this is why the model is defined as naïve as we simply just treat the words independently, so:

Finally, we have P(Words in tweet|Sentiment), which means the probability of the words in that tweet given a certain sentiment. Like above, we can assume that each word is independent, so for example if we had the sentence “overall I liked the movie”, we can work out the words for each sentiment (positive sentiment is given in the equation below) like so:

The difference between the prior probabilities of tweets and sentiment and P(Words in tweet|Sentiment) is the latter assumes a gaussian distribution based on the mean and standard deviation that scikit learn generates in the training process.

Gaussian distribution

While this gives the basic theory, the actual approach for the data training assumes a gaussian function. Gaussian functions are used to represent the probability distribution of a normally distributed random variable based on its mean and variance. Figure 3 below shows an example of a normal distribution.

*Figure 3: example of a normal distribution*

The actual equation for the gaussian function is as follows:

How is this implemented?

Rather than build our own function (although this maybe a topic for a future article), we can use the functionality available in scikit learn. The code and comments are displayed below.

After all that, when comparing the accuracy of the naïve bayes method, it only achieves an accuracy of circa 30%, which is pretty useless. This highlights that while simple, the naivety of the bayes classification method means that often it has limited effect in complex situations like sentiment classification based on textual data.

Note on validation

We did not use a validation approach for this article but I would recommend a full cross-validation approach for a more official solution to this problem.

Support vector machines

Given the poor accuracy from the naïve bayes solution, we use the more complex support vector machine method as an alternative. Support vector machines use the training data to find a hyperplane that separates different instances (i.e. data points) into their respective classes in a way that maximizes the margin between each class and the hyperplane. Figure 4 shows a simple support vector machine example where H3 demonstrates the largest margin separation between both of the classes.

*Figure 4: H3 maximizes the margin between the two classes*

SVM in non-linearly separable cases

In reality, datasets are very rarely linearly separable like the example above in figure 4. However, it is likely if the data can be transformed into a higher dimension (n) then it can be separated using a hyperplane of dimension n-1. This can be achieved using what is known as the kernel trick.

What the kernel trick does is it utilizes the datasets existing features, applies some transformations and creates new features. There are a number of kernels possible such as ‘linear’, ‘poly’ and ‘sigmoid’. This article won’t go into depth on support vector machines but this article explains the technique in much more detail.

Multiclass classification using support vector machines

For multiclass classification, the same principle is utilized after breaking down the multi classification into multiple binary classification problems. This requires the dataset to be transformed into a high enough dimensionality so that the figure below can be achieved between each class.

*Figure 5: multi-class classification (source:* *https://www.baeldung.com/cs/svm-multiclass-classification)*

How can this be implemented using sklearn

Thankfully again, we can implement this solution using sklearn as opposed to building a full solution from scratch. The code and comments for training the model is shown below.

Accuracy when compared on test set

The accuracy for this solution is a much more respectable 54%, which given the fact it is a multi-class classification problem and the use of simple packages is fairly respectable. However, compared to naïve bayes model, there will be much less generalisation in this case. You can learn more about the bias-variance tradeoff problem that plagues machine learning using this link.

Closing remarks

This article gives you some of the techniques and theory that can be used to classify textual information. However, the accuracy is still less than spectacular so some other solutions that have been applied to this dataset with much higher performance include:

· Gated recurrent units (a type of neural networks)

· Convolutional neural networks

· Long short-term memory

If you want to support the generation of content like this then please subscribe to me and Patrick’s notes.

References

https://towardsdatascience.com/support-vector-machine-simply-explained-fee28eba5496

https://medium.com/axons/natural-language-understanding-with-svms-87f1b8a63ea0

https://www.baeldung.com/cs/svm-multiclass-classification