“How emotional IS Donald Trump?” and other quantifiable markers for people who (professionally) dwell on the past

This is Blog 1 of 3 for the Fall 2017 ChiPy Mentorship, where mentees post regular updates for their projects.

Hello! This is my first blog for the Fall 2017 ChiPy Mentorship, which I’ll be participating in up through December 2017. Before we get started, some quick background, which may help explain why I’ve chosen to inflict the unforgiving task of analyzing tweets on myself:

Background

Never let a historian tell you that they’re clever at naming things.

  • A few years ago, I graduated from the University of Virginia with a double major in Computer Science and History.
  • If you listen closely, you can hear a soft chorus of, “That’s really interesting,” from the clouds, which is what nearly everyone says when I tell them that. Anyway…
  • Recently, I began attempting to learn Python via CodeAcademy and HackerRank. I’d never used it long-term either in the industry or during college. My primary language is Java, which is great in that it has robust error catching and beautiful, legible stack traces, but is a pain to use web modules or large data processing for. Python, which has been one of the top choices for both of those areas for several years running, was a language I’ve always wanted to pick up. After hearing about the ChiPy mentorship from my first meeting, I applied, and here I am!
  • Oh, and while all this was going on, a lot of politics happened.
  • Which brings me back to the history major I mentioned above. Typically, the areas I focused on involved nationalism run amok in several foreign countries. And as many historians, political scientists, and social scientists have studied, the common rhetoric used during those eras changes considerably; it often becomes more emotional, more knee-jerk, and deliberately sets up a group mentality. One commonly used technique is to refer back to an earlier, uniting event in the nation’s lifetime. These range from relatively benign (Louis Napoleon calling back to his uncle, Napoleon Bonaparte, in France during the mid-1800s) to severely emotional (pro-Serbian rhetoric in the 1980s and 90s, which often referenced Croatian atrocities towards Serbs during WWII. Awkward).
  • In our current political climate, I wondered: what rhetorical quirks might we find our current politicians using, and could we link them back to politicians of the past? The example of Donald Trump using “law and order”, a Nixonian-popularized phrase, is perhaps the best documented. Were the others we could find? Or even just differing word choices and grammatical structures between people on opposite sides of the political realm?

Project

I met with my mentor, Joerg Rings, to hash out some project goals and timelines. After all, the beast does have to be achievable in 2.5 months, and I was someone who had a good programming background, but not much recent experience, and no experience at all when it came to large-scale projects in Python. We had our work cut out for us. Walking into my first meeting with my mentor, my general project idea was to use tweets to look for word and symbol frequencies over time, then compare them to other politicians, but I was well aware there were already some issues with that concept. But hey! That’s what mentors are for, right?

Joerg, who is a data scientist at Capital One working with text-based data, helped me refine my concept into a specific project plan. He also pointed me towards several Python extensions and towards Jupyter Notebooks as the best programming platform for this task. (I’d been working in Sublime, or Eclipse for Java).

It’s good to start with the basics.

When it came to refining the project itself, we had a trickier time. From the analysis side, we could almost definitely use Pandas and mapplotlib, but we’d still need a good text based API. We also needed data, but it had to be useful. That meant a large sample size, over a long period of time, and, most importantly, in an easily accessible form.

Initially, I’d thought of using tweets, since tweets are accessible through Twitter’s API, but that would run into a few snags almost immediately. For one, it’s possible that a politician may have deleted one, and it would no longer be available to access. On a more serious note, it also didn’t appear possible to batch-download tweets, which would put any data collection under a severe time crunch.

Luckily, the Internet came to our rescue.

As it turned out, a helpful fellow named Brendan Brown had decided it would be smart to have an archive of all of Trump’s tweets, and had built a script in September 2016 to download from Trump’s page several times a day and archive them to trumptwitterarchive.com. Currently, the script operates in real time, and all in all, he claims to only be missing about 4000 tweets over Trump’s 35k+ feed (not including deletions made prior to Sept 2016). Not bad. And to make matters even better…

I may cry.

…they’re all archived in JSON format on GitHub. He’s even broken them down into two files, one with “condensed” meta information (date, source, associated user info only) and a master file with everything, AND there’s several other politicians and associates being tracked. This was looking a lot more manageable.

Joerg and I decided to break the project into three chunks (and there’s three blogs! How convenient!), which we could adjust depending on how it was progressing:

  • Blog 1 (today!): Project breakdown, timeline, and set up. This includes getting Jupyter Notebooks set up, identifying key Python extensions for analysis and plotting, and of course, finding data. Our goal for Blog 1 is to start with a small chunk of several hundred tweets and build an algorithm to identify commonly used words. You may have noticed that trumptwitterarchive already does some of this for their homepage; this will actually be pretty helpful, as it’ll give us something to compare our findings against.
  • Blog 2 (10/19): Expand our algorithm to more data to make sure it can handle larger-scale processing. At this point, we also start bringing other political figures among Trump’s allies: selection TBA, but figures like Mitch McConnell, Paul Ryan, Mike Pence, and Jeff Sessions would likely feature. We’re looking to see if they use some of the same language that Trump does; if we can narrow it down on specific topics, great, but otherwise, just looking for the same grammatical beats.
  • Blog 3 (11/16): Into the home stretch! Here, we hope to expand to the analysis to the opposite side of the field; Bernie Sanders, Elizabeth Warren, and other prominent liberals. Are there similarities? Are the difference? Who knows? I don’t, that’s why I’m building this project.

We also came up with a few “stretch goals”, like any good Kickstarter.

  • Expanding our analysis to other countries, for example, like Great Britain, Canada, and Australia, or India, Iran, Israel, and Turkey, if we wanted to branch out of the British colonies. That brings with it another set of complications; namely, that I don’t speak any languages besides English. Things to consider for the future…
  • Another idea was to expand our analysis towards longform documents, instead of tweets. The UVA Miller Center maintains a database of most presidential addresses made throughout history; what might we see from comparing those? If we wanted to standardize it, we could even limit our analysis to State of the Unions or inauguration speeches. We would also be limited to speeches made while president, not any campaign speeches.
  • And the last one, for really ambitious data scientists: mapping out how Trump-isms spread through popular lexicons. (None of us were saying “Sad!” a year ago). Like the man or not, there’s no one more gifted at creating memes in today’s society. Honestly, I have no idea where I’d even start with this one, and even Joerg wasn’t sure, but I’m documenting it in case someone else has ideas. (Seriously. Tell me if you have ideas. If I don’t get to them, I’ll pass them to some aspiring grad student.)

Either way, I’m excited to see how far we get! I’ve been in Chicago for just over a month, and I’m excited to become part of the Python and ChiPy community here. And of course, any time I can wield my history degree for non-history reasons is always a fun time.

Next month, on Blog 2: discussions of how I’ll break tweets down to process individual words, tally up frequencies, and expand that analysis to politicians other than the elephant in the room. See you then!