Oil painting from Joseph Turner. Photo Credit: Pixabay

A practical guide to emotional sentiment analysis on Twitter

A coronavirus vaccine online storm

Catalao Alves
Published in
8 min readSep 15, 2020

--

The ongoing competition for a viable coronavirus vaccine is arguably the race of the century. With its hundred of millions of users, Twitter is particularly well-suited for research into the sentiments and emotions that are quickly emerging in this regard.

Back in March, having as a background the harshest lockdown in history, the first news about the possibility of a Covid-19 vaccine transpired in the midst of a tough geopolitical affair.

I scrapped the Twitter API for signs of public reaction to one of the first public news about the possibility of a viable coronavirus vaccine. What I found resembles closely what digital marketing research refers to as an online firestorm.

In this 8 minute read you will hardly find more than a few glimpses of analysis, for this article is mainly designed as a learning expedition. For the most part, I will approach the topic as a use case for exploratory data analysis, text mining and natural language processing (NLP) in social media.

For a more complete view of the techniques involved in the case I strongly advise you to look at the jupyter notebook and other files on Github. There you will find the Python code, and a pdf file with full set of results, some of them not included in this article.

So, without any further delay, let’s get down to it …

Overview

This post is organised as follows:

Step 1: Exploratory analysis

Step 2: Text processing

Step 3: Sentiment analysis

Step 4: Word frequency

Step 5: LDA Topics extraction

Step 6: Emotion analysis

Step 1: Exploratory analysis

I collected the data by scrapping tweets from Twitter’s application program interface (API), using the Python programming language, with the help of a dedicated library — TwitterScraper. Tweets were scraped using the search term “Curevac”, the name of a German vaccine maker backed by Bill & Melinda Gates Foundation, and currently working on a Covid-19 vaccine.

The resulting data covers tweets from a 6-year period, from March 3, 2014 to March 18, 2020 (N = 14,991)

A first look at CureVac’s life on Twitter over the past six years shows a steady and regular path, until one day, on the 15th of March 2020, everything changed:

In three days, we have more tweets (N = 11,364) than in the previous 6 years. What happened?

Apparently, it all started with an article at Germany’s Welt am Sontag newspaper. The Washington Post was quick to echo the news that the coronavirus vaccine chased by CureVac would be developed “only for the USA”, and the White House was keen to secure the rights and move the research to the United States. The online storm was on!

Step 2: Text processing

Twitter is a rich source for opinion mining. But first one has to deal with its informal language, characters’ length limit, frequent misspelling, incorrect word order and abundant codes.

Depending on the purpose of the analysis, tweets must undergo a pipeline of tokenisation, filtering, case normalisation, lemma or stemma extraction, as well as overall cleaning of html and other codes.

Alongside a general cleaning of the text, I used the Python NLTK (Natural Language Toolkit) for tokenisation, POS tagging and lemmatisation.

After text processing and duplicates removal, the final sample amounts to 5,508 English-language tweets, with an average of 30 words (SD 12.5, ranging from 4 to 61 words).

Here are a few lines, where you can see the difference between the original tweet (column “text”) and the lemmatised, cleaned, tweets

(column “edited”).

Step 3: Sentiment analysis

For sentiment analysis — a growing NLP sub-field — I used VADER (Valence Aware Dictionary for Sentiment Reasoning), a rule-based system that performs specially well on social media data.

The most useful metric is the Compound score. It is calculated by a sum of the scores of each word, normalised to output values between -1, the most extreme negative score, and +1, the most extreme positive. For a complete understanding of how VADER computes its Compound score you have this conference paper.

Out of this normalised score, I created a categorical variable (“sentiment”), with an output of positive, negative and neutral ratios of sentiment, using the following thresholds:

  • Positive sentiment : (compound score >= 0.05).
  • Neutral sentiment : (compound score > -0.05) and (compound score < 0.05).
  • Negative sentiment : (compound score <= -0.05)

Here is the online storm …

And here is a comparison of the sentiments before and during the online storm.

In sentiment analysis, neutral tweets usually outnumber the negative or positive ones. This is what actually happened during the 6-year period before the online storm. Also, research has been showing that scientists tend to use neutral language while communicating among peers, particularly in social media.

The picture clearly changed during the 3-days online storm. Sentiments became less neutral, as it is also likely that the majority of the tweets come from a wider public. The percentage of positive tweets increased, suggesting higher expectations about a viable vaccine for coronavirus.

It is also worth paying attention to an even stronger increase in the percentage of negative sentiments during the online storm. This calls for a deeper look at the data. That is what we will do now.

Step 4: Word frequency

Now that our text is pre-processed and cleaned, it is time to try to spot key patterns of word frequency in tweets posted before and after the online storm.

Tweets before the online storm

These are the 20 most frequent words before the online storm.

And now the 10 most frequent trigrams (three consecutive words).

There are some noteworthy features in the above plots:

  • Along with ‘gate’ (ie., Bill Gates), the most frequent words in 6 years of tweets are ‘develop’, ‘therapeutic’, ‘deal’ and ‘news’. Unsurprisingly, these were times when tweets were used mainly as public relations devices to communicate the core business of CureVac.
  • The trigrams reinforce these trends, and with a stronger focus on collaboration. These are mainly about ‘next generation in health care’ and ‘pharmaceutical deals’ carried out in ‘broad strategic collaborations’.

Tweets during the online storm

What we’ve seen above shows obvious differences from the main stream life of CureVac on Twitter:

  • The top word is no longer ‘gate’ but ‘trump’ (ie., Donald Trump), immediately followed by ‘coronavirus’.
  • Gone are the days of collaboration for the next generation of new and innovative therapies.
  • ‘Exclusive’ takes the lead, ‘collaboration’ is out of the league.
  • The most frequent trigram is ‘try buy exclusive’. These are now times for ‘exclusive large gain’.
  • ‘Buy’ becomes a new key word. ‘large sum money’ and ‘offer large sum’ are now the top trigrams in the chart.

Step 5: LDA topics extraction

LDA (Latent Dirichlet Allocation) is an unsupervised machine learning technique that is increasingly popular in most text mining toolkits. You can find here a comprehensive article on the subject, published on Medium, covering extensively the assumptions and the math behind the algorithm.

I applied LDA in the two different periods (before and during the online firestorm) to check whether the findings corroborate the trends that we have seen in our previous analysis of the word frequency.

Let us see what we got.

Topics before the online storm

Topics during the online storm

A comparison between topics before and during the online storm shows contrasting themes.

For a period of six years, the major topic emerging from tweets is about collaborative developments. In contrast, during the online storm, the two topics that stand out are clearly about the alleged attempt of the USA administration to ensure the exclusive rights for the coronavirus vaccine.

Step 6: Emotion analysis

I drew from Robert Plutchik’s wheel of basic emotions an attempt to uncover the presence of the eight lexical units for anger, fear, sadness, disgust, anticipation, joy, surprise and trust.

Plutchik’s wheel of emotions

My approach was to create a matrix to compute and display the connection between each word in each tweet to one or more emotions in the National Research Council Canada (NRC) lexicon. The NRC is a dictionary with 14,182 words and 10 columns rows, each corresponding to positive and negative sentiment plus eight emotions. For a full understanding of the NRC lexicon read this article.

Here are some lines of the resulting matrix:

And here is a plot of the words in the tweets that contributed the most for each of the 8 emotions:

Let us look at the dynamics of the emotions during online storm, normalised with z-score values for each entry in the matrix:

And here is how each one of them evolved during this period:

One can see that the initial emotions of surprise, anticipation and trust start to fade out, whereas anger, disgust and fear increase by the third day of the storm.

Another way of looking at this dynamic is to display opposite emotions, as ‘joy’ and ‘sadness’.

Let us conclude with a final comparison between the two periods, before and during the online storm. Values for each emotion take into account the number of tweets in each of these two periods.

Again, the results suggest that the first news of a potential coronavirus vaccine were received with joy, anticipation, surprise and trust. However, emotions of fear, sadness and anger are also installed, namely as a reaction to news of alleged exclusivity and business-driven ideas in such a sensitive theme. Indeed, as I am writing this post, the WHO has issued an alert against “Covid-19 vaccine nationalism”.

I hope you enjoyed this short reading. Again, for the complete code and full set of results, you are welcome to my Github page.

Carlos Catalao Alves

--

--

Catalao Alves
iNOVAMedialab

Education PhD, researcher at iNOVA Media Lab and Invited Professor of science communication at NOVA FCSH.