How to Break Down a Tweet (or, ChiPy Post 2 of 3)
All right. If we’re processing tweet information, the first step is to get tweets. We’ve achieved that, thanks to trumptwitterarchive.com, at least when it comes to the man himself.
How do we break down the tweets into individual words, though?
As shouldn’t surprise anyone, there’s been a lot of development in recent years towards processing spoken and written language — “natural language”, it’s generally called. We’ve all had our interactions with it, too — autocorrect sometimes seems the bane of our collective existences. It’s especially tricky when trying to parse emotion, nuance, or sarcasm. In this particular case, though, we don’t need to go that far; we’re primarily concerned with the words themselves at the moment. No need to jump into machine learning yet!
The general process for breaking down a piece of text into word sets is generally called “tokenizing”: a tokenizer will break down the text into an array of individual words and punctuation. Word tokenizing is a major component of text analysis, as it allows you to break down text into discrete pieces more easily manipulated by an algorithm. As you might also imagine, there’s several available for Python, generally available as part of other natural language processors.
We ended up using the Natural Language Toolkit, possibly the most common and most extensive natural language toolkit for Python. It’s a bit old and slow, but it has a few things that made easy to work with: first, it had extensive documentation, making it quite easy to troubleshoot, and second, the toolkit already had a Tweet-specific word tokenizer. Yes, it can recognize @’s and hashtags.
NLTK, as it’s commonly referred to, also had something else of value — a dictionary of “stopwords”, or English words considered too common and functional to be of much analytical value. Think pronouns, short function words, the like. Import that list and apply it to the tokenized data, and we can easily reduce our word set down to more characteristic words.
Of course, there isn’t an official list of English stopwords — most language processors maintain one, but they’re usually separate from the main algorithms. And ours required a few more tweaks to clean up — for example, I quickly noticed that pieces of bit-shortened URLs kept appearing in the processed data, so I added “//” to my local stopwords dictionary to attempt to strip them out.
(It’s true that someone might have a legitimate, non-URL usage of “//” in a tweet, but since our subject matter is politicians, not Java and C# developers, I figured taking it out was safe.)
(It’s also true that the URLs each person chooses to tweet may have value, but since those URLS are no longer human-readable, they’re officially out of scope for this project.)
After all that, we just need a quick sort to count the frequencies of each word, order them from most common to least, and voila — our first data set.
Take a second to celebrate that…and it’s time to move on. The first issue that I immediately noticed at this stage was that I’d forgotten to make the tweets lowercase before tokenizing them — don’t want to miss combining “president” and “President”! I also realized that we needed some way of procedurally combing words with a common root — “American” versus “Americans”, for example.
You know what that means — back to NLTK.
My mentor directed me towards the NLTK Stemmers and Lemmatizers, which are implementations of algorithms for breaking down and combining words with common roots. Stemmers, specifically, attempt to reduce every word down its original root word (the “stem”). Most of these algorithms were developed by academics and professional linguists for the purpose of identifying word frequencies. On one hand, that means they’ve been rigorously tested, but on the other…
They also reduce combine letters for easier comparison, which makes the results a bit hard to read for us dabblers.
Mentor to the rescue again — NLTK also supports an implementation of WordNet’s lemmatizer, which compares every word to a known word in the English dictionary, and if it’s been modified, reduces it down to that original form only. None of this “all y’s are now i’s” nonsense.
Now, with our data mostly cleaned up, it’s time for comparisons. I was able to download Hillary Clinton’s tweets from 2016 for a comparison (she is, after all, frequently accused of being unemotional, so she seemed a good choice). We’ll run those sets of data against each other, and limit it to the top 100 most common words to avoid too much noise -
And now we have the top 100 words that appear in both of their Twitter feeds. Great! Tune in next time, where we throw pandas and several graphing APIs at our data to see what comes up.