Photo by Tengyart on Unsplash

Can we understand sentiment from just 140 characters?

Sentiment analysis on Twitter data

When Twitter was first created in 2006, co-founder Jack Dorsey envisioned an SMS-based communications service for small groups of friends or colleagues. Now, 15 years later, in 2021, it has grown into an “information network” larger than anybody could have imagined. Used by individuals, celebrities, brands, political figures, and governments alike, it is estimated that 500 million Tweets are posted every day.

Until 2017, Twitter users were limited to expressing their thoughts and sharing their news in just 140 characters. These 140 characters even counted for links, images, hashtags, and tagging other user handles. Just how much can you say with 140 characters, and is it enough to convey emotion?

Understanding a user’s emotion and feelings — or “sentiment” — is a common problem in today’s world. With increased use of the internet, we communicate via text more and more. The absence of facial expressions and body language can make it harder to infer the tone and feelings behind messages. Consider the following example:

“Wow, I just really love the rain sometimes.”

We might think this is a positive statement from someone who loves the wet weather. But others may detect sarcasm in the statement and assume that the writer has disdain for the rain.

This is the kind of problem that sentiment analysis tries to tackle. Sentiment analysis attempts to understand contextual and subjective information in text data by using natural language processing techniques. It has a huge range of business applications, from monitoring customer attitude towards a brand, to identifying issues with a product or service, to conducting market research.

Now more than ever, customers take to social media platforms like Twitter to interact with the brands and companies they use most, be it to report an issue, raise a complaint, or praise them for a great product. It can be beneficial for brands to quickly understand their customers’ attitudes towards them so they can identify areas of improvement.

Today, we will use machine learning to train a model that is able to identify the likely sentiment of Tweets and sort them into one of two classes — positive or negative. We will analyse the factors that led to the classification to understand what drives sentiment in social media data. Finally, we will discuss how the results could be utilised by a business.

Exploring and pre-processing the data

The data we will use was kindly provided by the user kazanova on Kaggle. The data consists of 1.6 million Tweets from 2009, extracted via the Twitter API. We are interested in only two fields from the data: the text of the Tweet (which will give us our features), and the sentiment of the Tweet (which will be our target). Helpfully, the data is exactly balanced, with 800k Tweets each for negative and positive sentiment.

Let’s have a look at the data:

By nature, text data from social media can be messy. We often find short-hand language, slang, emoji, and inconsistent spelling, grammar, and punctuation. Twitter itself has a few additional nuances:

  • Users can reply to one another, meaning we see examples of users tagging each others’ Twitter handles (seen in rows 2, 5, and 9 in the table above)
  • Twitter users can utilise hashtags to increase the discoverability of their Tweets (seen in row 9 in the table above)
  • Photos and other media are often present in Tweets as links (seen in row 2 in the table above)

Before we embark on building a machine learning model, we will perform some steps to initially “clean” the data:

  • Replace emoji with placeholders, e.g. replacing “:D” with “happyface” and “:(” with “sadface”
  • Replace links with a placeholder of “urlplaceholder”
  • Replace hashtags with a placeholder of “hashtagplaceholder”
  • Replace user handles with a placeholder of “replyplaceholder”
  • Convert all text to lowercase

By using placeholders, we can normalize the data without losing the potentially important insight that an emoji, reply, hashtag, or URL existed in the original text. Emoji in particular could be valuable in determining sentiment.

After our initial round of cleaning, our Tweets look as follows:

Now we’ve cleaned our data, we will perform a few more pre-processing steps.

Negations and stop word removal

A common step of natural language processing is the removal of “stop words”, which are common words considered to not add any information to text, e.g. “and”, “an”, and “the”. Negation words like “no” and “not” are often removed as part of this process.

In our Tweets data, we add a step to expand contracted negation words to their full form, e.g. expanding “couldn’t” to “could not”. We also remove “not” and “no” from our list of stop words, as these can be important in understanding the true sentiment of a statement. For example, consider the phrase “not happy” — with default stop word removal, we would simply be left with “happy”, which is the complete opposite of the original statement!

Normalization and tokenization

The next step is to normalize the text and tokenize it. First, we remove punctuation from the text so we are left with only letters and numbers. Next, we use a tokenization method to split each text string into a list of individual words, known as tokens.

Lemmatization

Finally, we lemmatize each token, returning each word to its root form. This reduces the size of our vocabulary by grouping together words that have the same underlying meaning. For example, words like “walking”, “walks”, and “walked” may be lemmatized to their root form, “walk”.

What are the most common words in our Tweets?

Now we have pre-processed and tokenized our data, we have a vocabulary of all the words used in our Tweets dataset. We can use word clouds to explore the most common words observed in both the positive and negative sets of Tweets.

Word cloud containing most common words present in Tweets of positive sentiment. Words like “love”, “thank”, and “lol” are among the most prominent.
Figure 1: Most common words present in positive Tweets

The most common words in Tweets of positive sentiment are not too surprising, with “love”, “thank”, and “lol” being some of the most prominent. Here we also get a small insight into the topics most associated with positive Tweets — “weekend”, “friend”, and “sleep”.

Word cloud containing most common words present in Tweets of negative sentiment. Words like “work”, “miss”, and “wish” are among the most prominent.
Figure 2: Most common words present in negative Tweets

Again, some of the words that are most common in negative Tweets are not surprising, like “work”, “school”, and “miss”. We can identify some words that are common across all Tweets whether they are positive or negative, as they appear in both word clouds.

Training a machine learning pipeline

Now our data has been through several steps of cleaning and pre-processing, it’s time to start building a machine learning pipeline and train a model.

We will take a subset of 10k Tweets from our dataset (5k relating to each sentiment) to speed up the training and testing process, as we will test a variety of classification algorithms.

We create a pipeline consisting of our transformers (to be applied first), and our selected classification method. We will utilise Feature Union for our transformers so they are all applied to the raw input data in the same step rather than being applied sequentially.

Transformers

We will utilise two transformers from the scikit-learn library to create input features for our model — CountVectorizer and TfidfTransformer.

CountVectorizer will take our original text data and convert each Tweet into a collection of token counts. This helps us to understand the range of words present in our dataset and the counts of the words that are present in individual Tweets.

TfidfTransformer (where TFIDF stands for Term Frequency-Inverse Document Frequency) is more complex and provides information about the importance of words. It considers the frequency of a word in an individual Tweet, but also examines the number of other Tweets the word appears in. This helps to strike a balance where we give a higher score to words that appear more often, but offset the score if the word appears in a large number of Tweets.

Evaluation metrics

We will use precision, recall, F1-score, and accuracy as our primary metrics to evaluate the performance of our model. Precision, recall, and F1-score give good indicators of how well our model assigns Tweets to each individual class, as all consider the number of true positives, false positives, and false negatives.

We will use overall accuracy which will give a measure of the number of correctly classified Tweets in our test data. This is a good choice for our use case as we are dealing with a balanced dataset.

We will also consider the ROC curve and the area under the ROC curve (AUC) to understand how well our model performs. The ROC curve will show us the relationship between the true positive rate and the false positive rate. The AUC will tell us how likely it is that our model scores a randomly-selected positive instance higher than a randomly-selected negative instance.

Training a variety of classification models on our subset of 10k Tweets, we see some varied results in terms of our evaluation metrics:

The testing above provides a baseline score for each model before tuning any parameters. We will choose to proceed with the Logistic Regression model as it provides one of the highest accuracy scores, while maintaining an advantage of easy interpretability compared to some of the other classifiers.

Training on the whole dataset and hyper-parameter tuning

Having chosen to proceed with our Logistic Regression model, we can train a new pipeline on the entire set of 1.6 million Tweets and think about the best way to tune the hyper-parameters relating to our pipeline.

We will allow scikit-learn’s GridSearchCV to do a lot of the heavy lifting for us, testing every possible combination of our chosen parameters. We will implement three-fold cross-validation to ensure we are choosing a robust model.

We will provide the following set of parameters to test with Grid Search:

  • The n-gram range — we will consider unigrams only, unigrams-bigrams, and unigrams-bigrams-trigrams. Looking at bigrams and trigrams will allow the model to understand more context from groups of words, especially in cases of negations which we discussed earlier.
  • The minimum document frequency — we will test selecting only terms appearing in at least 1 Tweet, at least 10% of Tweets, and at least 20% of Tweets. This will give the option to remove words or phrases that appear so infrequently that they are considered of low importance.
  • The maximum document frequency— we will test selecting only terms appearing in up to 50% of Tweets, up to 65% of Tweets, up to 80% of Tweets, and up to 100% of Tweets. This will give the option to remove words or phrases that occur so frequently across our dataset that they are not considered to contribute to sentiment.
  • The maximum number of iterations — we will test setting limits of 100 and 250 iterations allowed for the model to converge. This will allow the model to find the optimal solution in the case that more iterations are required.

The Grid Search query provides the following set of parameters as the best-performing for our model:

Meaning, in the transformation steps we will consider both unigrams and bigrams. We will only keep terms that appear in at least one Tweet (which in actual fact means we will not remove any terms at this step), and we will remove terms that appear in more than 65% of Tweets. We will allow the model 100 iterations to converge.

These parameters significantly increase the performance of our model:

Our model has an accuracy of around 81% and is able to classify both positive and negative Tweets equally well. The AUC of our model is 0.891, showing that there is around an 89% probability that our model will score a randomly-selected positive case higher than a randomly-selected negative case.

The ROC curve for our model is also promising:

A ROC curve with strong curvature to the top left corner of the chart.
Figure 3: ROC Curve of the trained Logistic Regression model

There is a strong curvature to the top-left corner of the chart, indicating our good AUC score.

What helps us to predict sentiment?

It’s all well and good to know the sentiment of a Tweet, but it’s even better if we understand the features that led to that Tweet’s classification. Thankfully, Logistic Regression models have a good level of interpretability to help us understand just that.

A bar chart showing the top 20 important features to make classifications using our Logistic Regression model.
Figure 4: Top 20 features for our Logistic Regression model

In the bar chart above we can see the top 20 most important features in helping our model make a classification. Our features here are either individual words or bigrams. Features with a negative score are important in making a negative classification, while features with a positive score are important in making a positive classification.

It is somewhat comforting to see the terms appearing in our most important features — we would expect words like “sad”, “sick”, “hate”, and “hurt” to indicate negative sentiment, for example.

Let’s look a bit more closely to the top 20 features important to each class.

Figure 5: Top 20 features for positive classifications

We can see several words and phrases that would normally indicate happiness and excitement in the top 20 most important features. If any of these words appear in a Tweet, there is a good chance the Tweet would be classed as showing positive sentiment.

Interestingly, “thanks” and “thank” appear in the most important features for positive Tweets. Some might consider these more neutral terms, but perhaps in the context of Twitter their presence indicates higher levels of politeness or agreeability.

Figure 6: Top 20 features for negative classifications

The word “sad” has a significantly larger importance than the other top 20 features for negative sentiment classifications, which is expected. We see words relating to sadness and other negative emotions like anger. We can also see some words relating to mortality and illness.

How can we use what we’ve learned?

One question you might ask at this point is how we can utilise everything we’ve done today, from a business point of view. While this analysis was more of an introductory exploration of sentiment analysis of social media data, there are several next steps that could be taken.

A brand or company might consider collecting a new set of Tweets that are directed at them or that mention their name or products. They could use the model trained in this analysis to understand customers’ views and opinions of them.

They could also add an additional layer of topic analysis to their Tweets to gain further insight into what their customers are talking about when they display negative or positive sentiment towards the brand.

In terms of improving the performance of the model itself, there are a few additional steps we might consider in the future:

  • Retraining the model using a set of more recent Tweets — our dataset only contains Tweets from 2009 and it is likely that language used on social media has evolved since then.
  • Engineering additional features to be used as inputs to the model.
  • Trying out different machine learning methods such as Neural Networks.
  • Testing additional classes for more complex emotions — i.e. rather than simply using positive/negative, trying to detect happiness, anger, excitement, sadness, etc.

We return to the question posed at the beginning of this article — can we understand sentiment from just 140 characters? We have found that a machine learning model only needs individual or pairs of words to confidently understand sentiment, so I would answer this question with a firm, “yes”.

But if you’re not yet convinced, I’ll leave you with a short sentence and you can try and decide the sentiment yourself —

“Thank you very much for reading! :)”

This article was written as a deliverable of my final project of the Udacity Data Science Nanodegree. The supporting notebook and full code can be found on my GitHub page. The data can be obtained from Kaggle.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store