Detecting Fake Political Tweets with Machine Learning

Published in

Analytics Vidhya

8 min readAug 25, 2019

Amidst all the talk about Russian collusion and obstruction in the 2016 US election, one facet of the election that’s gone largely overlooked is how Russia attempted to use social media to influence this election. One way Russia did so was through “troll factories”, producing incendiary tweets on Twitter to polarize public sentiment here stateside.

When I saw that FiveThirtyEight had written about and compiled a dataset of these fake Russian tweets, I was curious to explore these. As you’ll see in the wordclouds, some features of these tweets seemed reminiscent of Trump’s tweets. As such, I’ve decided to compare the two! For this project, I used this 200,000 troll Russian tweet dataset and this dataset of Trump’s tweets between mid-2015 and Election Day 2016, both from Kaggle.

In order to compare Trump’s tweets to the Russian ones, I hope to answer the following questions:

What are the Trump and Russian tweets about? What words do they use most often?
How does the sentiment of each vary over the election cycle?
Which words are the most “Trumpian”, and which ones are the least?
How reliably can we distinguish a Trump tweet from a fake Russian one?

Cleaning the Data

Our first task was cleaning the data to get meaningful text for analysis. For each dataset, I started by selecting the columns that pertained to the analysis, namely the date tweeted, the hashtag, the time tweeted, and the tweet text.

From there, I noticed two key elements of the tweet text that needed cleaning: the presence of corrupted or non-English characters in the Russian tweets, and the presence of links. I figured that the text in links would not be germane to this analysis, so I removed both “https://..." and “t.co/…” URLs.

From here, I began feature extraction — converting the timestamp into meaningful dates, and extracting the sentiment and tweet length.

I then used the syuzhet package to get a numeric “score” of the sentiments of each of the tweets, and used stringr to get the length of each tweet. Each of these columns were appended onto the datasets, and then the two datasets were combined to create data frame which we could analyze.

What Do These Tweets Talk About?

So what do these tweets talk about anyway?

As you’ll see, on the face of it, the words used in the two can be awfully similar. In fact, can you tell which is which?

Scroll for the answer!

A closer look reveals that the first wordcloud is Trump’s.

However, we see a number of commonalities in what they talk about. Beyond the obvious focus on Trump, we see “Hillary” featuring prominently in both of them, for instance. However, we see that Trump’s focus tends to be more broad, talking about topics like the news (Fox News and CNN), polling, and his signature “Make America Great Again” campaign slogan. Conversely, with the Russian wordcloud, we note a large emphasis on Obama than in Trump’s wordcloud. Interestingly, the Russian one also seems to have more emphasis on race, with words like “black” and “white” both featuring in it and not in Trump’s.

Tweet Sentiment

With the numerical scores we obtained for sentiment, we now plot those over time for the months we have both Trump’s and Russian tweets.

Unsurprisingly, as alluded to in the wordclouds, the Russian tweets tend to be far more negative than Trump’s. In fact, while almost all of Trump’s months had net positive sentiment scores, only a handful of months saw net positive Russian tweets.

What I find interesting in this figure is how the spike in Trump’s sentiment around April 2016 broadly aligns with the Russian spike in May 2016, and how they both dip sharply in the months subsequent. While it’s hard to pinpoint a precise reason for these spikes, April 2016 saw Trump take a large lead in the Republican primaries. Perhaps, his success in primaries and ultimate confirmation as the nominee contributed to the spike in sentiment. Similarly, events like Bernie Sanders’ endorsement of Hillary Clinton in July and the Democratic National Convention could have contributed to Trump’s distinctly harsher tone.

Do Hashtags Impact Tweet Content?

An interesting dimension of this dataset was the use of hashtags. One of the most distinctive parts of Trump’s Twitter feed is the abundant #MakeAmericaGreatAgain hashtag. I was curious to see how Russian tweets used hashtags, and whether the presence of a hashtag impacted the content of a tweet. The graph below shows how the use of hashtags changed with time.

While Trump’s hashtags were more iconic, it appears that the Russian tweets used far more hashtags in almost every month! Trump’s use of hashtags crashed in May 2016 from 40% of tweets down to just over 20%, while the Russian use of hashtags fluctuated between 40% and 60% of all tweets. So, does the presence of a hashtag affect the substance of the tweet?

This graph above shows how sentiments vary based on both the tweeter and the presence of a hashtag. Based on the graph, for the most part there isn’t a substantial difference in the month-to-month sentiment of Russian tweets with hashtags. However, strikingly, that isn’t the case with Trump! Trump’s tweets with a hashtag were overwhelmingly positive, with a sentiment score of around 0.5 on average. However, without a hashtag, they were consistently lower, even going below the sentiment of Russian tweets in one month!

In terms of why this could be, a possible explanation could be that Trump’s #MAGA tweets contain a sense of optimism and hope for the future, leading to a positive sentiment score.

How Can We Distinguish a Trump Tweet From a Fake Tweet?

Having explored features of the tweets like sentiment, length, and hashtags, we now arrive at the core question, how can we distinguish a Trump tweet from a fake Russian one, and how reliably can we do so using text analytics?

In order to do this, we first split our tweets into a test and training dataset.

I created a classifier to mark a tweet as Trump’s or non-Trump’s, and then
I built a Document-Term Matrix, logging the frequency of each word in the tweet.
From here, I used a LASSO model to select the most important terms in predicting whether a tweet is Trump’s or not.
With these variables, I created a logistic regression model to give a probabilistic estimate of whether a tweet is Trump’s or not.

From this model, we can conclude that a highly positive coefficient means a word is very “Trumpian”, or unique to Trump, and a word that is very negative is very “non-Trumpian”, or unique to tweets that aren’t Trump. This is because the presence of a highly positive word will increase the probability of a word being classified as Trump’s.

So, which words are most “Trumpian”? Below are wordclouds of the most “Trumpian” and “non-Trumpian” words.

…and these words are the best indicator of a fake Russian tweet

The figures above show which words are most and least Trumpian, respectively. We see that, unsurprisingly, positive, grand words we commonly associate with Trump’s discourse, like “great” and “big” appear to be the best indicator of whether a tweet is by Trump. He also seemed to focus more on polling, something which was largely omitted by Russian trolls.

Turning to the non-Trumpian words, we see the term #tcot jump out. A little Google tells us that it’s an abbreviation for “top conservatives on Twitter”, a way that conservative Twitter users identify themselves. More tellingly, the word “black” also stands out. Based on FiveThirtyEight’s article on these tweets, a lot of these tweets were fake Black Lives Matter posts, meant to stir up controversy and promote racial divisions. Thus, it makes sense that this features so prominently.

How Well Can This Model Spot Fake Tweets?

Quite simply put, really well!

In order to measure how well this model works, we applied it to our test dataset. We had a misclassification rate of around 15%, indicating that the model classified a tweet correctly 85% of the time. In order to more robustly measure the efficacy of our model, we used a ROC Curve, which gauges the model’s false positive and true negative rate. Our ROC curve using testing data is shown below.

Our model gives us an Area Under Curve, or AUC, of 0.855, which is far better than the baseline of 0.50. Overall, this tells us that this model is very effective in predicting whether a tweet is Trump’s or not.

Conclusions

This project has explored a range of topics surrounding troll Russian tweets and how they compare to President Trump’s in the year before the election. Here’s what I found:

Russian tweets eclipse Trump’s manifold in terms of volume, and the number of Russian tweets spiked towards Election Day 2016
Trump’s tweets were more positive than the Russian ones, and the sentiments had similar peaks and troughs
Trump’s tweets were longer than the Russian ones, however got shorter as Election Day approached. The Russian tweets got longer.
Hashtags matter with Trump! While Russian tweets’ sentiment didn’t vary very much whether there was a hashtag or not, Trump tweets with a hashtag were substantially more positive
We can reliably distinguish Trump’s tweets from a fake Russian one using LASSO and logistic regression
Some of the most “Trumpian” words are “great”, “big” and “very”, while the least Trumpian word was #tcot, which stands for Top Conservatives on Twitter, a common hashtag used by conservatives.
Our model was able to successfully classify Trump’s tweets with 85% accuracy, and has a 0.855 AUC under the ROC curve with testing data.

Originally published here: https://arjungovind55.wixsite.com/trolltweetsandtrump