Sentiment analysis of Uganda’s first presidential debate — using Twitter data and R

Richard Zulu
Feb 13, 2016 · 10 min read
  • Anticipation and Trust are the two (2) top most emotions displayed by the users in their tweets during the debate.
  • Polarity of the debate from the tweets was mostly positive
  • There was a lot of joy expressed around the debate moderators. People tweeting were generally impressed with them.
  • Almost half of the participants were creating or having original conversations. 45% originality in tweets and 55% retweets and replies to one another
  • Abed vs. Mbabazi vs. Kyalya vs. Besigye vs. Mabirizi vs Kasujja were happening. This means there was a close interaction between those presidential candidates. Almost zero (0) mentions of presidential candidates Biraro and Baryamureeba. Allan Kasujja was a moderator of the debate, however, he was so involved in the conversations by the audience on Twitter.
  • Presidential candidate Mabirizi was the most discussed topic by approximately 30% of our tweets. A demonstration of the humor surrounding the debate in relation to that specific candidate.

How did we do this

We used open source statistical programming language called R, images were developed using ggplot2 library and the open source graphics tool inkscape.

How we did it and what we observed in detail

  • Scores of all tweets
  • Highest scoring handles
  • Who tweets/RT/Replies the most?
  • Wordcloud
  • etc

Tweets Over 1Hr?

We collected tweets for over just an hour ~20k and as you can see from the trends graph below from 20:46 Hrs to 21:47 Hrs. On average almost 300–500 tweets a minute. As noted earlier, we expect that in the beginning of the debates there were so many tweeps just jumping onto the handle and at the end so many were dropping over the hashtag as it went into the wee hours of the night.

With Hashtags?

A further analysis into the tweets to see how many tweeps had got the memo and kept the conversations across the hashtags. We noticed a large percentage of users ~87% maintained the #UGDebate hashtag while others just tweeted without the # sign.

Replied Tweets

We further looked into the conversations among tweeps to notice if any tweeps engaged one another. As you might notice lots of tweeps kept listening or watching the debate while tweeting and had little or no time to respond to others on twitter. This shows that there was actually lots of original data coming in from the tweeps.

Retweeted?

Ok as we noticed lots of originality in tweets, we wanted to look further into if the tweeps were just re-tweeting or coming up with their original content. As you might notice an almost shared percentage of tweeps with original tweets (~45%) versus those re-tweeting (~55%).

Tweet vs RT vs Reply

On further analysis we dived into the trends in how people were tweeting, replying and retweeting (RT). Could it be that people tweeted, waited for the end and responded to others tweets? From the trends below we noticed that users were tweeting, retweeting and replying or responding to other tweets concurrently as the debate went on. A further analysis we want to carry out for another day was view how the tweets dropped off towards the end (24hrs EAT) and how tweeps jumped onto responding to other tweeps.

Char Per Tweet.

So how do you know if these tweeps were really engaging? we wanted to see how much of the characters they were utilizing of the standard 140 chars that twitter limits. As you might notice from the graph below a larger percentage were hitting and surpassing the 140 chars limit on twitter which indicates they had so much to say but were just limited. The histogram shows a skewed graph towards to the right (limits). Interestingly looking again you will notice a few awkward tweeps with less than 10 chars. Could these be users new to twitter? Could these have used mobile phones and entered the few chars by mistake? Why would almost 1k tweets have that? yes tweets less than 10 chars seem wrong when many other users are running into 140k limits.

Who was tweeting the most? Handles or Bots? Media or journalists or individuals or presidential candidates?

A deep dive into the actual twitter handles or tweeps who were tweeting. Who were tweeting most? Were they from the media? Were they paid bots ? etc We wondered as we showcased the data on the graph below. We notice 2 outliers. One tweep with over 200 tweets in 1 hour and another with over 100 tweets in an hour. Why would anyone tweet that much even more than the media houses? Would they be paid bots? We kept asking ourselves.

Polarity — What was the sentiment displayed by participants?

This approach uses a supervised learning algorithm to build a classifier that will detect polarity of textual data and classify it as either positive or negative. Here we used NRC Lexicom emoticon :http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm and the Bayesian Naive Classifier : https://en.wikipedia.org/wiki/Naive_Bayes_classifier feel free to read up on these 2 classification models as we keep this out for another discussion on why we chose these two methods.

What emotions were displayed by the participants

A further analysis on which emotions were mostly being displayed in the tweets. You might notice that Trust and Anticipation came out as the top emotions from the users on twitter under the #UGDebate16 session.

What were participants mentioning most on Twitter

A word cloud shows us the best visualization of words that were most tweeted within the sample size of tweets that we analyzed. Note the the larger the words, the more it was tweeted and also take a close attention to the words near the center as these were mostly tweeted.

Comparison cloud — How were participants engaging in relation to the various emotions discovered?

A further analysis looking into how tweeps used words in relation to the emotions we notice some interesting findings here.

What was the relation between the words mentioned on twitter?

We further use Hierarchical clusters to follow how statistically which words were so related to one another. A dive into the methodology we used you might want to read into Ward’s minimum variance method and Hierarchical Clustering:https://en.wikipedia.org/wiki/Hierarchical_clustering

What issues/topics were people talking about most

So what topics were people talking about. Here for example if people mentioned Makerere, School, University we would model this into Education. As you might notice from our findings we got a high percentage (~30%) of people talking about topic “ugdebate, rt, i, mabirizi”. Just shows how the debate diverted into humor tweets on twitter.

Summary

We have the data and are further analysing these datasets. We have questions we are already looking into and these can only be shared on request or further discussions with our analysis team.

  1. Social Network diagram analysis of handles. Who influences who?
  2. In depth topic modeling, removing all the noise/stopwords/stemming?
  3. Choropleth Map of district mentions
  4. Association Mining — take some of the interesting terms from the frequent terms and find patterns , correlations and associations.

Outbox research

Stories about our work and learnings in data civic projects and data analysis work for organisations. We are a technology innovation hub supporting African entrepreneurs

Richard Zulu

Written by

Outbox research

Stories about our work and learnings in data civic projects and data analysis work for organisations. We are a technology innovation hub supporting African entrepreneurs