Who’s Tweeting from the Oval Office?

Did Trump type out that tweet? Or was it an aide in Trump clothing?

This is Part 1 of a 4-part series. Check out the whole series! And be sure to follow @whosintheoval on Twitter to see who is actually tweeting on Donald Trump’s account!

  1. Who’s tweeting from the Oval Office?
  2. Choosing features
  3. How did the different models do?
  4. Let’s get this bot live on Twitter!

Motivation

On December 1st, 2017, Michael Flynn pleaded guilty to lying to the FBI. The next day, Trump’s personal Twitter account tweeted:

This was quite controversial because on February 14th of that year, the day after Flynn resigned, Trump had asked James Comey, then the director of the FBI, to back off any investigations of Flynn. If Trump knew at the time of his request to Comey that Flynn had indeed lied to the FBI, then Trump’s tweet could be seen as evidence that Trump attempted to obstruct justice. After several legal experts argued this point, Trump defended himself by claiming that his lawyer John Dowd wrote and posted the tweet. But did he really?


Background

Forensic text analysis was an early field in machine learning and has been used in cases as varied as identifying the Unabomber to discovering J.K. Rowling as the true identity of the author Robert Galbraith to determining the specific authors of each of the Federalist Papers. This project is an effort to use machine learning and these same techniques to identify tweets on @realDonaldTrump as written by Trump himself or by his staff while using his account. This task, however, is unique and particularly challenging due to the short nature of a tweet — there just isn’t much signal to pick up in such a short text. In the end, I did succeed though with almost 99% accuracy. You can follow my Twitter bot @whosintheoval to watch it post in real-time with predictions whenever Trump tweets, or read on to learn how I built this.


The Data

Prior to March 26, 2017, Trump was tweeting using a Samsung Galaxy device while his staff were tweeting using an iPhone. From this information provided in the metadata of each tweet, we know whether it was Trump himself or his staff tweeting (see Further Reading below for some articles discussing this assumption). After March however, Trump switched to using an iPhone as well, so identification of the tweeter cannot come from the metadata alone and must be deduced from the content of the tweet.

I used Brendan Brown’s Trump Tweet Data Archive to collect all tweets from the beginning of Trump’s account in mid-2009 up until the end of 2017. This set consists of nearly 33,000 tweets. Even though I know from whose device a tweet originated, there is still some ambiguity around the authorship because Trump is known to dictate tweets to assistants, so a tweet may have Trump’s characteristics but be posted from a non-Trump device, and also (especially during the campaign) to write tweets collaboratively with aides, making true authorship unclear.

From the beginning of Trump’s Twitter account, on May 4th, 2009, until he stopped using an Android device in early 2017, there are over 30,000 tweets of which I know (or at least have a good guess about) the author (crucially, the Flynn tweet doesn’t fall into this date range so I had my models make their best guess as to the true tweeter — more on this in the third post later in this series). These 30,000 tweets are fairly evenly split between Android / non-Android (47% / 53%) so class imbalance wasn’t an issue. This was my training data. Using several different techniques, I created almost 900 different features from this data which my models could use to predict the author.

In the next post, I’ll go into more detail about these features! Stay tuned!


Who Made This?

I’m Greg Rafferty, a data scientist in the Bay Area. You can check out the code for this project on my github and see what else I’ve been up to on my LinkedIn.