Who’s Tweeting from the Oval Office?

11 min readFeb 15, 2018

How Did the Models Do?

This is Part 3 of a 4-part series. Check out the whole series! And be sure to follow @whosintheoval on Twitter to see who is actually tweeting on Donald Trump’s account!

Who’s tweeting from the Oval Office?

I’ve built a Twitter bot @whosintheoval which retweets each of Donald Trump’s tweets and offers a prediction for whether the tweet was written by Trump himself or by one of his aides. Go ahead and read the first post in this series if you’re curious about the genesis of this little project, or read on to learn about the models I built!

In the first half of this post, I’ll discuss the models I used to arrive at predictions. It might get a bit technical, so if that doesn’t interest you and you just want to skip to the results, go ahead! You’ll find them in the second half.

Models

To begin, as is standard in the field, I split my data into an 80% training set and 20% testing set. I set aside the testing set until I was satisfied that all of my models were as accurate as possible, and then sent the testing set through them to get the performance measures I’ll be reporting here.

Feature importances

One of the more important tasks I did was to sort my features in order of their influence on the outcomes of the models. To do this, I used scikit learn’s Ridge Classifier. Ridge regression is a form of logistic regression which includes a regularization factor, alpha. At alpha = 0, ridge regression is the same as an unregularized logistic regression; at low alpha levels, the coefficients of the least-influential features are forced to zero, effectively removing them from the model; at higher alpha levels, many more features are removed. I recursively iterated over every alpha level, dropping out features one by one, until none remained.

As you can see in the plot above, at an alpha level just above 10²², the first (least influential) feature drops out. In the range of 10²⁵, feature dropout rapidly increases, leaving only the most influential features left at alpha levels above 10²⁶.

The individual models

In total, I built 9 models: Gaussian Naive Bayes, Multinomial Naive Bayes, K Nearest Neighbors, Logistic Regression, Support Vector Classifier, Support Vector Machine and the ensemble methods of AdaBoost, Gradient Boosting, and Random Forest. Each model was carefully tuned using 10-fold cross validation on the training data alone, and evaluated on the test data.

Cross validation is an effective technique for training these models without biasing them too much towards the specific data they’re being trained on; in other words, allowing them to generalize to unseen data much better. In 10-fold cross validation, the data is split into 10 equally sized groups, groups 1–10. In the first training iteration, the model is trained on groups 1–9 and tested on group 10. The process repeats, but this time it is training on groups 1–8 and 10, and tested on group 9. This training step is repeated 10 times in total, so each group is withheld from the training set one time and used as an unseen test set. Finally, the combination of model parameters which had the best average performance across all 10 folds is the set of parameters to use in the final model.

The algorithms behind these models are all very fascinating; they each have their own strengths and weaknesses, having different balances along the bias — variance tradeoff and sometimes vastly different processing times (training naive Bayes, for instance, takes fractions of a second whereas the support vector classifier and the gradient boosting methods both took an entire weekend each to perform a grid search). If you’re interested in learning more, I would start with the Wikipedia entries for these models:

Furthermore, using those feature importances generated above, I trained each model on a subset of the total of almost 900 features. Naive bayes, for instance, performed best with only the top 5 features whereas both boosting models were happiest when crunching through the top 300. This is partly due to the curse of dimensionality; the fact that in higher-dimensional space, two points which seem to be near each other (when imagined in our 3-dimensional minds), can be actually very, very far apart indeed. In particular,the k-nearest neighbors model (knn) is highly sensitive to too many dimensions, so I also applied principal component analysis (PCA) to the data fed into this model.

PCA is a technique which can both reduce dimensionality and eliminate any collinearity between the features. If you can imagine a set of vectors in higher-dimensional space, PCA will twist and massage these vectors so that each and every one of them is perpendicular to all of the others. If these vectors represent features, then by forcing them all to be orthogonal, we’ve also ensured that no collinearity exists between them. This will vastly improve the predictions of a model such as knn, and can allow us to reduce the number of features sent to the model without reducing the amount of information. In short, this enabled me to get much better performance out of my knn model.

The ensemble

Lastly, I created two different ensembles of each of these models. The first one was a simple majority vote: with an odd number of models and a binary output, there will never be a tie between the models in disagreement, so I simply added up all of the predictions for Trump and all of the prediction for an aide, and for my final prediction offered whichever was greater. My second ensemble was a bit more sophisticated: I took the results of those first nine models and fed them into a new decision tree. This final model had near-perfect accuracy on my test set, but as I’ll discuss in my next post about building a Twitter bot, it didn’t perform quite so well on current tweets.

And now, finally, the results..

Results

As you can see, the gradient boosting model and random forest performed best, with an error rate of only 1 out of 20.

The other models performed less well individually, but contributed a great deal to the final ensemble. The decision tree that I built from the results of the first set of 9 models had an accuracy score of over 99%!

If you’re unclear what all those measures are, here’s an brief explanation. Accuracy is the most intuitive of these measures, it is simply the number of guesses that were correct divided by the total number of guesses, ie, out of all my guesses, how many were correct? Precision answers the question, out of all tweets I guessed to be Trump, how many actually were Trump? Recall is almost-but-not-quite the opposite of precision; it answer the question, out of all tweets that actually were written by Trump, how many did I get right? F1 score is a blend of precision and recall, technically the harmonic mean (a type of average) of the two. It is not nearly as intuitive to understand as accuracy but when the class imbalance is large, f1 score is a much better measure than accuracy. In the case of this tweet data though, my classes were very well balanced which is why all of the measures are more-or-less equal in the above chart. If this is at all confusing to you, or you’d just like to learn more, here is an excellent blog post about these measures.

So what characterizes a Trump tweet?

Quoted retweet
@mentions
Between 10pm and 10am
Surprise, anger, negativity, disgust, joy, sadness, fear
Exclamation points
Fully capitalized words
@realDonaldTrump

As I expected, the quoted retweet I described in my first post was highly predictive of a Trump tweet. So were @mentions of other users. Trump often tweets during the night and early morning, and on weekends. He displays surprise, anger, negativity, disgust, … in fact all of the emotions, not just the negative ones emphasized so much in the press. He does indeed use exclamation points and fully capitalized words more than is grammatically necessary. And lastly, he mentions himself an awful lot.

His aides, on the other hand, post tweets characterized by:

True retweets
The word “via”
Between 10am and 4pm
Semicolons
Periods
URLs
@BarackObama

If a tweet is a proper retweet, you can bet confidently it was posted by an aide. Interestingly, the word “via” came up a lot in aides’ tweets — they often would quote an article or image and attribute it with that word. Predictably, they tweet during the workday and not very often outside of it. Their grammar is more sophisticated, with better sentence structure and punctuation, and they post URLs to other sources very frequently. Interestingly, if Barack Obama’s Twitter username is mentioned in a tweet, it’s usually an aide. Trump would mention him by name, but not by @mention.

With regards to the parts-of-speech tags, Trump’s most frequent combination is NN PRP VBP, or a noun, personal pronoun, and verb. These tweets frequently take the form of an @mention followed by “I thank…” or “I have…” Aides often write NNP NNP NNP, three proper nouns in a row, which is often the name of an organization. They also use #hashtags following text whereas Trump uses #hashtags following an @mention.

I was a bit disappointed that the parts-of-speech tags weren’t more significant to the model. I knew that the specific vocabulary in a tweet would change over time and so I wanted to capture more grammatical structure which I reasoned would be more constant. However, the main challenge of this project is the short nature of a tweet and this did greatly reduce the amount of grammatical signal my models could pick up. What this means for my model is that although it has an almost perfect accuracy rate on historical tweets, that accuracy drops off quite a bit on current tweets.

Additionally, three features which were highly predictive on historical tweets were tweet length, number of times favorited, and number of times retweeted. However, I had to drop all three of these features and retrain my model for deployment on real-time tweets. For the second two features, favorite count and retweet count, the reason is a bit obvious: I’m trying to predict the author immediately after the tweet is posted, so it has not been favorited or retweeted yet. Tweet length, however, was dropped for a different reason. In all 33,000 tweets in my training data, Twitter had limited the character count to 140. But only recently has Twitter increased this count to 280. This means all that training on this feature had to be thrown away.

A little game

So with those characteristics in mind, let’s play a little game. I’ll offer a tweet and I invite you to guess the author.

Don’t scroll down too far, because the answer will be right below! Here’s the first one; who wrote this, Trump or an aide?

This is a bit easy. What do you see? There’s that word “via,” highly indicative of an aide tweet. It includes a link, again another telltale sign of an aide. It’s posted in the middle of the day (I scraped this tweet from California, so the timestamp is 3 hours behind Washington DC), and it’s very formal and unemotional: all signs of an aide.

And yes, you’re correct, that was posted by an aide! OK, here’s another one:

Is that Trump or an aide? Again, let’s go over it together. This tweet contains more emotion than the other, that’s usually a Trump sign. There’s that exclamation point: another Trumpian touch. Remember to add 3 hours to the timestamp; that puts it at 7:30pm, after the workday has ended. With that in mind, we can confidently guess that this was written by…

Trump! Yep, correct again!

The Flynn Tweet

So, this is the big one, the tweet that started this whole project:

Now, this tweet came after March 26, 2017, which if you remember from my first post is the date after which there are no labels to identify the true tweeter. All we’ve got to go on is my model. In truth, this is a difficult tweet to guess. It contains the words “lied,” “guilty,” “shame,” and “hide.” Those are all very emotionally charged words — possibly indicating Trump as the author. But it’s also somewhat formal; the grammar is well composed and it contains some longer-than-average words: those are signs of an aide. It was tweeted around midday, also suggesting an aide. But it’s very personal, suggesting Trump. So what did the models say? Here’s the raw output:

rf [ 0.23884372  0.76115628]
ab [ 0.49269671  0.50730329]
gb [ 0.1271846  0.8728154]
knn [ 0.71428571  0.28571429]
nb [ 0.11928973  0.88071027]
gnb [ 0.9265792  0.0734208]
lr [ 0.35540594  0.64459406]
rf [1]
ab [1]
gb [1]
knn [0]
nb [1]
gnb [0]
svc [1]
svm [0]
lr [1]([1], [ 0.15384615,  0.84615385])

That “rf” at the top, that’s the random forest. It predicted a 1, or Trump, with 76% probability (the first seven rows show probabilities of first an aide and then Trump; the next nine rows show the prediction: 0 for aide, 1 for Trump). “ab” is AdaBoost, which also predicted Trump, but with only 51% to 49% probability — not very confident at all. The gradient boosting model was more confident, 87% likelihood it was Trump. KNN however disagreed: 71% probability the tweet was written by an aide. The multinomial naive Bayes predicted Trump, but the Gaussian naive Bayes predicted an aide. There was also disagreement in the two support vector machine models: SVC predicted Trump and SVM predicted an aide (due to the way these models are created, they cannot output a probability estimation, which is why they’re absent in the top half of the results). Logistic regression was a bit on the fence with 64% probability of Trump and 36% probability of an aide. That’s 6 models for Trump, 3 for an aide.

In reality, after spending weeks reading over and analyzing thousands of Trump tweets, I think this tweet is one of the best examples of a collaboratively written tweet. Topically and emotionally, it’s 100% Trumpian. But stylistically and grammatically, it appears to have come from an aide. In my opinion, Trump probably worked together with Dowd to craft the tweet. Trump told Dowd what he wanted to say and how he wanted to say it, and Dowd composed the actually tweet. That’s my best guess.

This just goes to show that these models aren’t perfect, there’s a lot of disagreement; and also that a tweet contains very little information for machine learning to train on. My final ensemble, the decision tree, which was over 99% accurate on my testing set, did offer a final prediction of Trump, with 85% probability (that’s the last line in the output above). So that’s what we’ll go with: Trump. Not John Dowd, his lawyer. So their claim that Dowd wrote the tweet and not Trump, we can only assume that it’s:

In my final post, I’ll show how I built a bot to sniff Twitter constantly and capture any tweet from @realDonaldTrump, send it through my model’s prediction algorithm, and retweet it on my account @whosintheoval with a prediction of the author. Stay tuned!

Who Made This?

I’m Greg Rafferty, a data scientist in the Bay Area. You can check out the code for this project on my github and see what else I’ve been up to on my LinkedIn.