Sentiment analysis of Uganda’s first presidential debate — using Twitter data and R
At Outbox, we analysed Uganda’s first presidential debate of 2016 in real-time. Our objective was to develop an understanding of what various individuals participating in the conversation feel about the issues discussed in the debate. More to that, we wanted to develop insights on the issues being discussed and the quality of the conversation. As such, we settled for a sentiment analysis of the data collected.
The analysis confirmed a number of the trends we have been seeing. These include:
- Presidential candidate Abed Bwanika surprised several people who tweeted. He was the most mentioned candidate under the emotion classifier of surprise
- Anticipation and Trust are the two (2) top most emotions displayed by the users in their tweets during the debate.
- Polarity of the debate from the tweets was mostly positive
- There was a lot of joy expressed around the debate moderators. People tweeting were generally impressed with them.
- Almost half of the participants were creating or having original conversations. 45% originality in tweets and 55% retweets and replies to one another
- Abed vs. Mbabazi vs. Kyalya vs. Besigye vs. Mabirizi vs Kasujja were happening. This means there was a close interaction between those presidential candidates. Almost zero (0) mentions of presidential candidates Biraro and Baryamureeba. Allan Kasujja was a moderator of the debate, however, he was so involved in the conversations by the audience on Twitter.
- Presidential candidate Mabirizi was the most discussed topic by approximately 30% of our tweets. A demonstration of the humor surrounding the debate in relation to that specific candidate.
We also do understand that the twitter users are from a specific demographic who are from the middle class, tech savvy or corporates between the ages of 18–45 years of age. These are actually a good proportion of the urban semi affluent and affluent voter cohort. So any of these conclusions from our findings for a larger percentage of the population are really baseless.
How did we do this
We do confirm that before midnight, on the night of 15th January, 2016 we had over 100k tweets under the hashtag #UGDebate16. That was from the tweets between 18Hrs and 24Hrs EAT. Sources: tweetreach.com & www.hashtracking.com
How we did it and what we observed in detail
As most of you might know we didn’t buy this data from twitter but used the open API’s to access the datasets in real time. So we hit Twitter rate limits and errors were thrown at us during the process of scrapping the tweet datasets.
Nevertheless, we managed to scrape 20k tweets in realtime from the #UGDebate16 timelines. That was approximately 20% of all the meaningful tweets during the heated debate.
A little disclaimer here that as we all know, correlation doesn’t imply causation and we just don’t want anyone to come up with any conclusions from our basic tweet analyses. An interesting image we love reusing to explain this is the message from the NRA in US many years ago that when 170 million New guns increased in the community there was a total 51% decrease in violent crime :). We be careful with such conclusions.
So what went next, a breakdown of the processes that we went through to achieve this report or findings. First was the data preparation which we thought was really boring and the steps below occurred.
- Data preparation (So booooring)
- Get data from Twitter
- Hit limits -> Issues
- Understand twitter terms
- neat CSV file
- Clean up data
Next we went through the data exploration, which we thought was more fun and could go on and on .. For this report we just share some interesting bits and leave the detailed findings for those who would want to follow up with us on request.
- Exploration (So much funnnn!!)
- Scores of all tweets
- Highest scoring handles
- Who tweets/RT/Replies the most?
Tweets Over 1Hr?
We collected tweets for over just an hour ~20k and as you can see from the trends graph below from 20:46 Hrs to 21:47 Hrs. On average almost 300–500 tweets a minute. As noted earlier, we expect that in the beginning of the debates there were so many tweeps just jumping onto the handle and at the end so many were dropping over the hashtag as it went into the wee hours of the night.
A further analysis into the tweets to see how many tweeps had got the memo and kept the conversations across the hashtags. We noticed a large percentage of users ~87% maintained the #UGDebate hashtag while others just tweeted without the # sign.
We further looked into the conversations among tweeps to notice if any tweeps engaged one another. As you might notice lots of tweeps kept listening or watching the debate while tweeting and had little or no time to respond to others on twitter. This shows that there was actually lots of original data coming in from the tweeps.
Ok as we noticed lots of originality in tweets, we wanted to look further into if the tweeps were just re-tweeting or coming up with their original content. As you might notice an almost shared percentage of tweeps with original tweets (~45%) versus those re-tweeting (~55%).
Tweet vs RT vs Reply
On further analysis we dived into the trends in how people were tweeting, replying and retweeting (RT). Could it be that people tweeted, waited for the end and responded to others tweets? From the trends below we noticed that users were tweeting, retweeting and replying or responding to other tweets concurrently as the debate went on. A further analysis we want to carry out for another day was view how the tweets dropped off towards the end (24hrs EAT) and how tweeps jumped onto responding to other tweeps.
For a better visualization we wanted to get this fully blown into percentages just to make sure that we were right earlier on the tweets versus re-tweets patterns. As you might notice, by just scanning a slightly larger percentage ~55% were just re-tweets and the remaining ~45% broken into original tweets and retweets.
Char Per Tweet.
So how do you know if these tweeps were really engaging? we wanted to see how much of the characters they were utilizing of the standard 140 chars that twitter limits. As you might notice from the graph below a larger percentage were hitting and surpassing the 140 chars limit on twitter which indicates they had so much to say but were just limited. The histogram shows a skewed graph towards to the right (limits). Interestingly looking again you will notice a few awkward tweeps with less than 10 chars. Could these be users new to twitter? Could these have used mobile phones and entered the few chars by mistake? Why would almost 1k tweets have that? yes tweets less than 10 chars seem wrong when many other users are running into 140k limits.
A further analysis as we pulled on those datasets showed us that we were thinking the wrong way. Apparently from the previous analyses you might have noticed a small proportion of Reply tweets. Yes, when you respond to another person you normally don’t write much. It’s normally a quick “i agree” or “sorry please check again” etc and these are the kind of tweets we noticed.
Who was tweeting the most? Handles or Bots? Media or journalists or individuals or presidential candidates?
A deep dive into the actual twitter handles or tweeps who were tweeting. Who were tweeting most? Were they from the media? Were they paid bots ? etc We wondered as we showcased the data on the graph below. We notice 2 outliers. One tweep with over 200 tweets in 1 hour and another with over 100 tweets in an hour. Why would anyone tweet that much even more than the media houses? Would they be paid bots? We kept asking ourselves.
To further answer that question of who were tweeting most we decided to scale down to the top 30 tweeps within the debate. Do you recognise any journalists or media houses? These are the top people or twitter handles that were tweeting. So who owns these two accounts? @paulsenbulya and @fortunedavid
We did dig deeper into the twitter pages of the top 2 tweeps with over 100 tweets per hour and these 2 accounts with the profile images in the background definitely have our doubts answered.
Polarity — What was the sentiment displayed by participants?
This approach uses a supervised learning algorithm to build a classifier that will detect polarity of textual data and classify it as either positive or negative. Here we used NRC Lexicom emoticon :http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm and the Bayesian Naive Classifier : https://en.wikipedia.org/wiki/Naive_Bayes_classifier feel free to read up on these 2 classification models as we keep this out for another discussion on why we chose these two methods.
As you might notice, a high percentage of the tweets had a positive sentiment polarity compared to negative polarity.
What emotions were displayed by the participants
A further analysis on which emotions were mostly being displayed in the tweets. You might notice that Trust and Anticipation came out as the top emotions from the users on twitter under the #UGDebate16 session.
What were participants mentioning most on Twitter
A word cloud shows us the best visualization of words that were most tweeted within the sample size of tweets that we analyzed. Note the the larger the words, the more it was tweeted and also take a close attention to the words near the center as these were mostly tweeted.
What about we look into the top 100 words that were tweeted, a little drill down shows words like Debt, Mabirizi, Bwanika, Mbabazi and the normal stop words like the, what, if etc which are really not helpful.
Comparison cloud — How were participants engaging in relation to the various emotions discovered?
A further analysis looking into how tweeps used words in relation to the emotions we notice some interesting findings here.
What was the relation between the words mentioned on twitter?
We further use Hierarchical clusters to follow how statistically which words were so related to one another. A dive into the methodology we used you might want to read into Ward’s minimum variance method and Hierarchical Clustering:https://en.wikipedia.org/wiki/Hierarchical_clustering
You notice that bwanika and people were really close. While mbabazi, amamambabazi, kyalya, besigye etc were closely used alike.
What issues/topics were people talking about most
So what topics were people talking about. Here for example if people mentioned Makerere, School, University we would model this into Education. As you might notice from our findings we got a high percentage (~30%) of people talking about topic “ugdebate, rt, i, mabirizi”. Just shows how the debate diverted into humor tweets on twitter.
We have the data and are further analysing these datasets. We have questions we are already looking into and these can only be shared on request or further discussions with our analysis team.
On request / In pipeline:
- Sentiment Analysis on Candidates →Mabirizi(humor!) + Abed(on point!)
- Social Network diagram analysis of handles. Who influences who?
- In depth topic modeling, removing all the noise/stopwords/stemming?
- Choropleth Map of district mentions
- Association Mining — take some of the interesting terms from the frequent terms and find patterns , correlations and associations.
A huge thanks to the team behind this Michael Niyitegeka, Richard Zulu, Solomo Opio and Richard Ngamita.
Finally, as the second debate goes on tomorrow 13th, February 2016 — we shall be diving into the same hashtag #UGDebate16 and performing more detailed analyses and shall be sharing furthermore analyses like this hence comparisons between the 2 different debates. We welcome all your queries, suggestions and criticism to our approach.