Who’s Tweeting from the Oval Office?

6 min readFeb 14, 2018

Choosing Features

This is Part 2 of a 4-part series. Check out the whole series! And be sure to follow @whosintheoval on Twitter to see who is actually tweeting on Donald Trump’s account!

Who’s tweeting from the Oval Office?

I’ve built a Twitter bot @whosintheoval which retweets each of Donald Trump’s tweets and offers a prediction for whether the tweet was written by Trump himself or by one of his aides. Go ahead and read the previous post in this series if you’re curious about the genesis of this little project, or read on to learn about the features the model considers!

Features

I looked at six broad categories of features to build my model:

Trump quirks
Style
Sentiment
Emotion
Word choice
Grammatical structure

Trump quirks

Data science can sometimes be more art than science. To start off my model, I first thought about how I as a human would identify a tweet as Trumpian. I then did my best to translate these “feelings” into rule-based code. Some obvious quirks, for example, that can identify if Trump himself is behind the keyboard are an abuse of ALL CAPITAL LETTERS in his tweets, Randomly Capitalizing Specific Words, or gratuitous! use of exclamation points!!!!!

In fact, one of the most influential features in my model was what I came to refer to as the quoted retweet. Trump, it seems, does not know how to retweet someone on Twitter. In the entire corpus of 33,000 tweets, there is only one single proper retweet that comes from an Android device. Instead, Trump copies someone else’s tweet, @mentions the user and surrounds the tweet in quotation marks, then posts it himself:

These are often, but not always, self-congratulatory tweets like this one, which is why, as you’ll see in my next post discussing results, Donald Trump tends to @mention himself a lot.

Style

Stylistic features are those which aren’t specific to Trump’s own personal style, but instead could be used to identify any Twitter user. These types of features include the average length of a tweet, of sentences, and of words. I also looked at how many times various punctuation marks are used (Trump hardly ever uses a semi-colon; his aides do quite a bit more often). The number of @mentions in a tweet, the number of #hashtags, and the number of URLs all turned out to be strongly predictive features. Finally, the day of the week and the time of the day in which the tweet was posted were quite revealing.

Sentiment

I used C.J. Hutto’s VADER package to extract the sentiment of each tweet. VADER, which stands for Valence Aware Dictionary and sEntiment Reasoning (because, I suppose, VADSR sounded silly?), is a lexicon and rule-based tool that is specifically tuned to social media. Given a string of text, it outputs a decimal between 0 and 1 for each of negativity, positivity, and neutrality for the text, as well as a compound score from -1 to 1 which is an aggregate measure.

A complete description of the development, validation, and evaluation of the VADER package can be read in this paper, but the gist is that the package’s authors first constructed a list of lexical features (or, “words and phrases” in simple English) correlated with sentiment and then combined the list with some rules that describe how the grammatical structure of a phrase will intensify or diminish the sentiment. When tested against human raters, VADER outperforms with accuracy scores of 96% to 84%.

Emotion

The National Research Council of Canada created a lexicon of over 14,000 words, each scored as either associated or not-associated with any of two sentiments (negative, positive) or eight emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust). They kindly provided me access to the lexicon, and I wrote up a Python script which looped over each word in a tweet, looked it up in the lexicon, and output whichever emotions the word was associated with. Each tweet was then assigned a score for each emotion corresponding to how many words associated with that emotion it contained.

Word choice

To analyze word choice, I used a technique called tf–idf, which stands for Term Frequency – Inverse Document Frequency. It’s basically a measure of how descriptive and unique a word is to a document. Let’s say you want to group some news articles together so you can recommend similar articles to a reader. You set your computer up to read each article and one of them features the word “baseball” 10 times. That must be a pretty significant word in the article! That’s the Term Frequency part.

But now, that same article also has the word “said” 8 times. That seems to also be a pretty significant word. But we humans know otherwise; we know that if several articles mention “baseball,” they’re probably about the same topic, but if several articles mention “said,” that doesn’t tell us much about the articles’ similarity. So we then look at all of the articles in the collection and count how many of them have the words “baseball” and “said.” Out of, say, 1000 articles, only 30 have the word “baseball” but 870 have the word “said.” So we take the inverse of that count — 1/30 and 1/870 — and multiply that by the Term Frequency — 10 and 8. This is the Inverse Document Frequency part. So the word “baseball” gets a score of 10/30 = 0.333 and the word “said” gets a score of 8/870 = .009. We do this for every word in every document and, in a nutshell, look which articles have the same high-value words. This is tf–idf.

In order to reduce the computing needs of my model, I only looked at unigrams (single words) instead of bigrams and trigrams (tf–idf handles these small phrases exactly the same way it would handle a single word). Each n-gram requires exponentially more processing time and I figured that “Crooked Hillary” or “Lyin’ Ted Cruz” would still be picked up by the words “crooked” and “lyin’” on their own. I also ignored words that came up in over 99% of the tweets (known as corpus-specific stop words) and fewer than 1% of the tweets. I heavily used Python’s scikit-learn package throughout this project and that includes their tf–idf implementation.

Grammatical structure

One of the main challenges of using natural language processing on current events is that the events change over time. While the phrases “Crooked Hillary” and “Lyin’ Ted Cruz” came up a lot during Trump’s presidential campaign, they’re all but absent in current tweets. I wanted to capture a more basic form of Trump’s tweets, so I converted each tweet to a part-of-speech representation using the Natural Language Toolkit.

This essentially converts each word into its part of speech, staying aware of the word’s role in the sentence so as to differentiate the noun “insult” in the sentence “‘Crooked Hillary’ is used as an insult when Trump refers to his political opponent” from the same word used as a verb in the sentence “You insult the political process by reducing it to childish name-calling.”

This changes the phrase “I had to fire General Flynn because he lied to the Vice President and the FBI” to its more basic part-of-speech form as “PRP VBD TO VB NNP NNP IN PRP VBD TO DT NNP NNP CC DT NNP”, using the Penn part of speech tags (PRP = personal pronoun, VBD = verb, past tense, TO = to, VB = verb, base form, NNP = singular proper noun, etc). Using the same tf–idf process as before, but this time ignoring unigrams and focusing instead on bigrams and trigrams, I could extract a more general way either Trump or his aides tweet.

Lastly, I used the Stanford Named Entity Recognition (NER) Tagger to replace all names with “PERSON”, all locations with “LOCATION” and all organizations as “ORGANIZATION.” This was yet another attempt to generalize the tweets away from specifics that might change over time. This NER process was by far the most computationally expensive process during the handling of these tweets and, if I were to do this project again, I would seriously consider a less state-of-the-art NER tagger that does not rely on an advanced statistical learning algorithm, and would speed up processing time significantly. You’ve been warned!

Next up, I’ll be discussing the models I used and what the results were. Be sure to follow me and come back in a few days — I’ll be sharing who my model predicts is behind the Flynn tweet!

Who Made This?

I’m Greg Rafferty, a data scientist in the Bay Area. You can check out the code for this project on my github and see what else I’ve been up to on my LinkedIn.