Python, US Senators and Text Analysis: The Smart Way to Do Political Research

Published in

CEU Threads

4 min readMar 23, 2024

So imagine yourself being a political researcher of the next generation. Or maybe just a political science student, for starters. Your task could be something interesting — like comparing US senators in terms of the similarity of their political speeches. How would you do that — fast and smart?

In the old days, you would probably download (yes, the “recent old” days, not the “very old” days without the Internet) the speeches from the website of the US Senate, maybe print them and go through the never-ending, excruciating boredom of reading through every single speech, noticing patterns, underlining familiar expressions and desperately watching out for anything at least slightly amusing. The smarter (and maybe more pragmatic) students would potentially hire their friends or high school students to do the work for them.

The modern political scientists, however, have more advanced tools available. For example, they have Python. And that is a powerful tool. You can read more about my initial experience with Python here. As a part of another assignment in a different course, we have been asked to analyse the speeches of US senators from the 105th session of the US Senate (actually, kind of the “recent old” days, back in 1998).

So, how does one do that in Python? All future political scientists, listen up. First, you need to have all the text files available. We did not need to go through this, as we had the necessary documents from our instructor. Normally, you would, however, need to do some automated web scraping (unless, again, you want to hire some cash-hungry juniors to download the speeches manually).

Second, once you have set up the text files and extracted the actual speeches from your documents (there may be extra unnecessary information), you need to do something like text pre-processing. That might mean, in a very basic setting, removing all the stop words. What are stop words? Words that are really common and do not add any particular value to your text (words like a; an; and; but; how; in; on — and many more). To simplify the content further, you may also want to make all the words lowercase or remove the punctuation.

All set? Now that you have the essential things, you want to analyse the actual similarity. One of the ways to do that is by using a measure called cosine similarity. Technicalities aside, if you think of text as vectors (you can think of these as arrows), then these, based on the words they contain, point in different directions. Cosine similarity then measures how similar these vectors are, looking at their direction (angle) when compared to each other.

And that, eventually, is a number. Coming back to our senator example, once you are done with all the necessary code (and there are nice tutorials out there), you can rank up the cosine similarity scores for each senator. Then, say, we want to compare Senator Biden's speech to all other senators. (And that, indeed, was the task in our assignment). Who was the most similar senator to Joe Biden, the senator from Delaware, in the 105th session?

Well, it turned out, that my analysis pointed to Jon Kyl, the Republican senator from Arizona. That, however, seems a priori pretty wrong. First, Kyl ranked very differently on the conservative-liberal scale, according to e.g. a 2006 National Journal analysis. Second, Kyl was from Arizona, and Biden from the East Coast. Sure, that does not mean they must have differed politically but geography in this case could be a good predictor of similarity. It could still be the case that they were indeed very much like-minded and the above, trivially basic analysis is right. I am not betting on that card, though.

So what could help? Try lemmatisation. Le-mma-what? Well, we are talking about a technique that aims to transform words into their base or dictionary form (lemma). For example, the word “meeting” would be transformed to “meet”, “was” to “be” and “mice” to “mouse”. Lemmatisation aims to remove inflectional endings while keeping the core meaning of words. This could help render our analysis more precise.

When applied to my example above, we suddenly switched flags. The most similar senator to Joe Biden now happened to be Daniel Patrick Moynihan, a Democrat from New York. Yay! Both the party and geography now fit (sure, these are not perfect measures), so this is a kind reminder that simple analysis of any kind is to be preferably avoided (unless deciding whether you want to have pineapple on your pizza).

And there is so much more you can do. Using more advanced tools like Naive Bayes algorithms, you can try to predict what party your senator is. You can do sentiment analysis, showing how “positive” or “negative” the messaging of your favourite politician appears to be. And so on.

To conclude, Python again turns out to be a powerful tool. It can help you work with the text of your interest, leave the unnecessary behind and analyse the essential, all with the aim of reaching an informed conclusion. Political science has never had better tools at its hand. If only it would help people make better decisions.

Python, US Senators and Text Analysis: The Smart Way to Do Political Research

Written by Jakub Kvapilik