Get Bent: An Ideological Classifier. Part 2: Noisy Data

Published in

The Startup

4 min readOct 13, 2020

As I sat down to write this next chapter, Senator Mitt Romney of Utah released a statement calling our national discourse “a vile, vituperative, hate-filled morass.”

Regrettably, that quote is the sad undercurrent of my project. I set out to prove that there is a clear distinction between two competing political ideologies, a difference that even a computer could easily perceive. This project is not only evidence that the divide exists, but that it is vast and seemingly insurmountable.

Part Dos: Scrape It All, Let Pandas Sort it Out

Here’s where we are so far: I can’t tell users apart by simply reading one tweet mentioning the hashtag. I am making the assumption that those who tweet the hashtag must be active political users and they therefore must have other political tweets that can provide us new clues.

Finding these “political tweets” was a classic needle in a haystack problem. On top of that, the workaround scraper I found gave me almost no ability to filter what I received. The scraper I used, snscrape, is somehow able to get around the new Twitter API and html changes but a trade-off exists: if I want some of a specific user’s tweets, I need to take them all.

So I built a function that went through my chronological list of users and scraped every single tweet they’ve ever tweeted. The scraper was brutally efficient and once I was able to pull over 350,000 tweets in a single hour. This was my first experience with that sort of horsepower and I learned some important lessons here about data usefulness and data storage that I’m sure will be pertinent in my future employment.

For every user I downloaded their full Twitter presence. Every tweet, every emoji, every retweet, from 2020 all the way back to 2009 in some cases. The only information I could not recover was the content of the retweets, but I did get the twitter handle of the user they retweeted, a surprisingly important piece of the overall project, so stay tuned.

Between the two camps of users, it was easy to see a distinction and my assumption that those who tweet on political hashtags are political, was immediately apparent. Going to a user’s timeline and seeing that they are constantly calling for the arrest of Barack Obama and a heavy bible presence does not take much to discern their political party and this continued as the norm and not the exception. Very, very rarely I found an ambiguous timeline.

Taking a step back here, as political as this project is, I try not to take sides. I think from any reasonable perspective there are two competing ideologies in this country and they share very few similarities. The crux of this project is that liberals sound like liberals, and conservatives sound like conservatives. All of this relies on being able to clearly differentiate between the two and I am hyper aware of the presence of inherent bias in this project.

Google “Pierre Delecto” and thank me later.

My scraper is now running hot and I need to start thinking about how I can feed these words to my computer in a way that it can understand. My first option would be to use a Count Vectorizer to represent these users as a term-document matrix. The features for your model would then be the amount of times a certain word is used in one document compared to another. Theoretically, with this representation, a user who uses the word Trump more than another is more likely a conservative. The problem here is that I’m comparing users who range from prodigious retweeters to very limited, if ever. Comparing a user who tweets the word Trump 457 times to one who does 4 times, will yield nothing but noise. Count Vectorization is wholly inappropriate for our needs.

In a similar vein, Term frequency-Inverse document frequency (Tf-Idf) counts the frequency of word use but multiplies that by the inverse of it’s frequency in all the other documents. This will give us a similar document-term matrix, but with a bit more context on word choice. This gets us closer to what we’re looking for but everyone who tweets about politics is going to use the word Trump. And stimulus. And economy. Again, with such a huge and varying set of “documents” anything to do with counting word occurrences is almost useless. We need to go deeper and find more ways to represent the context of the words used by these active tweeters.

Stay tuned for our next exciting chapter where I use Doc2Vec with Gensim to begin training a model. Don’t miss it!

Get Bent: An Ideological Classifier. Part 2: Noisy Data

Part Dos: Scrape It All, Let Pandas Sort it Out

Written by Steven Markoe