Trump and Clinton may have used some Machine Learning

The US Presidential Election has undoubtedly been one of the topics of greatest attention in the past year. Social media has disrupted traditional political campaigning strategies and allowed for better understanding of people’s views towards political issues and candidates. Paradoxically, despite the large amount of available data, the election results went against most polls, analyses, and predictions.

In this study, we use social media data (specifically Twitter) to provide insights on each candidate’s popularity, tweeting patterns, and most common topics. Additionally, we attempt to model and predict the success of a new candidate’s tweet.

The dataset

We use a public Twitter dataset containing a total of 6 thousand tweets from the candidates’ official Twitter accounts: @realDonaldTrump and @HillaryClinton (about 3 thousand tweets each). Each tweet contains its text, date, number of times it was retweeted by users, number of times it was marked as favorite, along with some other metadata.

The dataset can be downloaded from the Kaggle website: 
 https://www.kaggle.com/benhamner/clinton-trump-tweets

The code

The code was written in R using the IBM Watson Data Platform. You can sign up for free at http://datascience.ibm.com/. A Jupiter Notebook containing the source code is available here:

https://github.com/IBMDataScience/election2016

Exploring the tweets

Donald Trump’s most frequently used words are as follows:

Note how the most common words in Trump’s tweets (i.e., great, will, thank) have very positive meanings, which is ideal for political campaigning.

Interestingly, Hillary Clinton’s most frequently used word on Twitter is trump:

Most viral Trump’s tweets

Tweet text Retweets Likes How long did it take your staff of 823 people to think that up — and where are your 33,000 emails that

you deleted?

167,274 294,162 The media is spending more time doing a forensic analysis of Melania’s speech than the FBI spent

on Hillary’s emails.

120,817 247,883 Happy #CincoDeMayo! The best taco bowls are 
 made in Trump Tower Grill. I love Hispanics! 82,653 115,107

Most viral Clinton’s tweets

Tweet text Retweets Likes Delete your account 490,180 660,384 “I never said that.” — Donald Trump, who said that. 91,670 134,808 Great speech. She’s tested. She’s ready. She never quits. That’s why 
Hillary should be our next @POTUS. (She’ll get the Twitter handle, too) 63,628 190,992

Basic statistics on tweets

Note Trump’s tweets are retweeted twice as much as Clinton’s. However, Trump’s tweets are between December 2015 and September 2016, whereas Clinton’s are between April 2016 and September 2016, meaning that Clinton tweets more frequently than Trump.

Note after both the Democratic and Republican nominations, both candidates obtained great attention. However, after the first debate (arguably won by Clinton), Clinton gained huge attention, which was not the case for Trump.

Modeling tweet success

After exploring the dataset, one interesting problem that comes to mind is to model and predict how successful a new tweet would be.

First step is to define success. In our case, we assume a successful tweet is one that is retweeted many times. One can immediately visualize a regression problem where the goal is to predict the number of retweets.

Second step is to engineer features that will be useful to describe meaning in the text. A very common and simple approach is the TF/IDF (Term Frequency/Inverse Document Frequency) algorithm. R’s tm (text mining) package offers built-in capabilities to extract TF/IDF features. There will be as many features as terms (i.e., words) in the corpus and as many rows as tweets.

Third step is to build a regression model to predict the number of retweets. We have chosen the Multivariate Adaptive Regression Splines (MARS) algorithm since it allows to model non-linear relationships in the data (which may be the case of TF/IDF features) and works well with high-dimensional data.

Prediction results

A model was trained for Trump’s tweets (the reader can use a similar aproach for Clinton’s tweets). After the model was trained, we tested it on 11 tweets with various topics.

Tweet number 1 was part of the training set and was in fact the most retweeted one (167,274 times) for Donald Trump. The predicted number of retweets was very close (161,918) which gives us a good upper bound of the model quality. R-square was around 0.75 during training.

Tweet number two is a random text completely unrelated to Trump’s vocabulary. As expected, it got a much smaller predicted number of retweets (4,517).

Third tweet is related to Trump’s ideology, thereby obtaining almost 20K predicted retweets. Notice the number of times words appear in the text is irrelevant as shown in tweet number 4.

Tweet text Predicted retweets 1How long did it take your staff of 823 people to think that up — and where are your 33,000 emails that you deleted?161,918 2I love crocodiles4,517 3I will defeat Isis19,886 4I will defeat Isis I will defeat Isis I will defeat Isis I will defeat Isis I will defeat Isis19,886 5will take trump realdonaldtrump hillari3,323 6I will make America great again23,866 7I like Hillary Clinton14138 8I hate Hillary Clinton14138 9Climate change is a hoax, and a very expensive one!40,023 10Climate change is real. We must act to save the planet.2,521 11College education should be free for everyone!4517

Tweet number five is a random permutation of the most common words by Trump (see word cloud above). It is expected to be very unsuccessful (~3K retweets). This is since TF/IDF not only considers term frequency but also normalizes feature values taking into account document frequency.

Tweet number 6 (Trump’s slogan) is expected to be successful, so is tweet number 9, given that Trump tweeted a similar idea using sligtly different words.

Tweets 10 and 11 are expected to be as bad or even worse than “I love crocodiles”, which is not surprising since they are against Trump’s political views.

Interestingly, tweets 7 and 8 have the exact same predicted success. This may be because the frequency of the words “like” and “hate” does not correlate with tweet success. The interested reader may use sentiment analysis (R has package qdap) to enhance the model in this regard.

Lessons learned

We have shown how the IBM Watson Data Platform can be leveraged to explore, visualize, and model data using the R language.

Simple statistics such as the number of retweets/favorites were already favoring Trump, which supports the election results.

Despite the the small size of both the dataset and the vocabulary (i.e., 3K tweets and 6K words), the predictions on Twitter success were shown to align to the candidate’s political views. We believe a larger dataset will definitely allow to improve accuracy and broaden the prediction topics.

The model was not able to differentiate negative vs. positive views towards a particular topic. We believe this can be improved by incorporating additional features from the sentiment analysis domain.

The MARS algorithm selects a smaller subset of the features. This means that some words were discarded, reducing the size of the vocabulary the model can handle. Using all the features would be ideal but it brings up scalability issues. In fact, algorithms such as Linear Regression and Support Vector Machines were not able to succeed when run on vanilla R. We suggest using SparkR to create a regression model with all features (i.e., words).


Originally published at datascience.ibm.com on December 21, 2016.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.