Twitter Deep-Dive Analysis: US Presidential Election

A Method for Predicting the Winner of the next USA Presidential Elections in 2020 using Data extracted from Twitter

Kamal Chouhbi
Analytics Vidhya
10 min readFeb 10, 2020

--

General Introduction :

Lately, the data science community has turned its interest in analyzing web data, such as blog posts or social networks’ users’ activity as an alternative way to make an accurate prediction for the upcoming US Presidential Election on November 2020.

Furthermore, traditional polls are too costly, while online information has become easy to obtain and freely available. One of the social media that allows researchers to use their data is Twitter. It has more than 200 million visitors on a monthly basis and 500 million messages daily. The advantages of using tweets from Twitter as a data source:

  • First, the number of tweets is very huge and they are available to the public.
  • Second, tweets contain the opinion of people including their political view.

So Twitter has become the first place voters go to seek accurate information and breaking news from political candidates during the presidential campaign. The company understood the significance of this responsibility and its teams started building new tools for people who use Twitter to identify original sources and authentic information.

In fact, Twitter decided to bring back candidate labels for the 2020 US Elections, the labels will appear on a candidate’s profile page, as well as any tweets and retweets they share. They will tell you the office the individual is running for, as well as in what state and district. (Read more)

Image: Twitter

Case Study :

One of the major battlegrounds of the 2016 presidential election was Twitter. It was one of the most publicized US Presidential Election in the history and Twitter had played an increasingly prominent role in it.

Now let’s focus on the cases of the two then-front runners for the 2016 U.S presidential election: Donald Trump and Hillary Clinton, the present study examines differences in the ways in which they self-present and communicate with voters through their Twitter account.

In line with the findings from previous studies on how women candidates tweet, this complete analysis help me find that Hillary Clinton out tweeted the Republican nominee, attacked him more often than he did her and tweeted about political issues far more often.

By the way, I’ve conducted another analysis on how candidates for the US Congress use Twitter, you can have a look at the article here.

Tweets Collection :

This dataset contains the tweets related to the 2016 United States presidential election. They were collected between July 13, 2016 and November 10, 2016 from the Twitter API using Social Feed Manager. It is an open source software that harvests social media data and web resources from Twitter, Tumblr and Flickr. I highly recommand using SFM to collect data and this is how I dit :

1- Define and organize collections

2- Add collections with various harvest types

3- Add seeds

4- Turn it on, SFM will harvest the data on an on-going basis or according to the schedule you specify

5- Export a collection to a spreadsheet

You can also feed data into your own processing pipeline from the command line. Actually, no specific expertise is needed to use SFM, but it is helpful to read the SFM documentation and be familiar with social media APIs.

Exploratory Data Analysis

I used a Tweets dataset which contain ~3000 recent tweets from Hillary Clinton and Donald Trump, the two major-party presidential nominees.

So we have approximately the same number of tweets for the two candidates. We can say that our data set is well balanced.

1. Tweets over time

We can see that Trump started tweeting since the beginning of the year, and also the influx of Clinton tweets starting around July 16.

2. When do candidates tweet?

Moreover it’s interesting to see the hourly tweets tendency, canditates tend to communicate more in the afternoon and the night .

3. Percentage of retweets

From this we can say that Donald Trump has used a little a bit more twitter than Hillary Clinton but Clinton’s tweets have been more retweeted than Trump’s one.

4. Tweets Language

Tweets languages in %

As we can see, most of the tweets used by the two candidates were written in English. But we have more than 100 tweets in Spanish :

This does make sense because Clinton’s campaign has made efforts to mobilize Latino voters who will be critical in the general election. In addition, she has a strong network of Latino leaders and activists.

5. Original authors of retweets

Data Analysis :

Extract Trump’s and Clinton”s Tweets

1. Word Clouds

  • Hillary Clinton’s Word Cloud :
  • Donald Trump’s Word Cloud :

Hillary Clinton spent significantly more time on political issues than Trump, who barely mentioned any issues on Twitter.

2. Mentions :

Extract mentions from tweets

Graph mentions most used by candidates (top 10)

While Hillary Clinton only mentions other Twitter users in every fifth tweet, most of the user mentions are for @POTUS (+125 mentions) then we have the @realDonaldTrump who is more than any other Twitter user mentioned on her account, including @TimKaine and @JoeBiden.

Donald Trump, on the other hand, does not mention @HillaryClinton’s Twitter handle. Donald Trump consistently mentions other Twitter users in two-thirds of his tweets, often mentioning his own account @realDonaldTrump (+280 mentions), but also @FoxNews (+70 mentions), @CNN (+60) and Fox News anchor @MegynKelly (+52 mentions) for their allegedly biased and unfair reporting.

Seems like one thing both candidates had in common was their frequent mention of Trump. Surprinsingly, while Trump was Hillary’s 2nd most mentioned account, she was not on Trump’s list.

It remains to be seen which strategy is better: engaging with and mentioning your opponent’s Twitter account or simply ignoring it.

3. Hashtags Wars:

Hillary Clinton is more sparing, using hashtags in only 18% of her tweets; most commonly #DemsInPhilly and #DebateNight. She has also used hashtags to react to her opponent during the Republican National Convention (#RNCinCLE used 35 times).

Donald Trump includes a hashtag in almost every other tweet, including #Trump, used +340 times, and #MakeAmericaGreatAgain, used 246 times.

It seems Trump was much more fond of hashtags than Hillary, using his favorite hashtag (#Trump) almost 9x more than she used her favorite (#DemsInPhilly)

Network of hashtags

Account to Hashtags : @realDonaldTrump

Account to Hashtags : @FoxNews

Sentiment Analysis

I tried to categorize the tweet column into Positive and Negative sentiments using TextBlob. The package contains several sentiment lexicons in the sentiments dataset.

TextBlob aims to provide access to common text-processing operations through a familiar interface. You can treat TextBlob objects as if they were Python strings that learned how to do Natural Language Processing.

Unlike what we could have think, most of the tweets are postive and neutral. Now I’ll try to focus on each candidate separately.

It’s really interesting to see that Donald Trump has a really large number of positive tweets, their represent nearly the double of his neutral or negative tweets.

Impersonating the candidates

Thanks to Markovify, we can use a Markov Chain Model to create synthetic tweets based on an existing library of actual tweets. Right now, its main use is for building Markov models of large corpora of text and generating random sentences from that.

Sometimes this works and it helps point the fake tweets in a certain direction.

Predicting who said it: Trump or Clinton

In this last section I’ll attempt to build models to correctly predict the author of a given tweet. To identify each author, I’ll create a bag of words containing the most-common words of the 2 authors combined. Bag of words is a list of the most common words of a given source of text. This set later becomes the basis for feature engineering.

I started with applying a Multinomial Naive Bayes Algorithm as a classifier.

Confusion Matrix on the training set

Then I decided to improve the model by using Pipeline and Cross Validation :

Confusion Matrix on the testing set

So this time we got an accuracy on the testing set of 92%. We can say that the NLP Powered Model has been successful in effectively classifying 92% unknown (Test Set) examples correctly. In other words, 92% of tweets are identified correctly that it belongs to which author among the two.

Now let’s test our prediction model and see what

Sum-Up :

The concept behind this study is to collect tweets using Twitter’s API and apply different algorithms in order to classify them and find trends in what candidates are saying about a specific topic. We were able to show how the social media, more precisely Twitter can be used to make prediction of future outcome such as election, to extract the sentiment or views of people who are likely to vote in the general election or have an influence on those who will vote, and Sentiment Analysis, to classify their sentiment.

So the goal was to gather tweets that refer to the elections and more specifically to the two main candidates of the US presidential elections: Hillary Clinton and Donald J. Trump.

Further Work:

Although, we still can apply our study on the future candidates of the presidential election in 2020.

Photo by Element5 Digital on Unsplash

The approach is by using selected data just weeks before the election. The prediction could be derived by comparing the number of tweets mentioning each candidate or by comparing the number of tweets that has positive sentiments towards each candidate. As we saw in this article, the number of tweets mentioning a party reflects the election result and the Sentiment Analysis approach from Twitter can reduce the error of the prediction result.

We can other methods like utilizing interaction information between potential voters and the candidates in each state separately. We can also create trend line from the changes in follower of the candidates and the size of candidates’ network (followers on Twitter and friends on Facebook).

What you should keep in mind is that despite the huge size of information that we can extract from social media, it still has small effect on the election results. Therefore, it only makes a difference in a closely contested election.

References :

Congratulations if you managed to get here. Thanks for reading, I hope you’ve liked it. For personal contact or discussion on Machine Learning, feel free to reach out to me on LinkedIn and don’t forget to follow me on GitHub and Medium.

--