Text Analysis: Comparing Similarities of Speeches between U.S. Senators

Analysis of Different Methods of Text Preprocessing and Creating a Model for Predicting Senator Party Affiliations

Published in

CEU Threads

6 min readMar 24, 2024

Politics can be a divisive subject amongst friends, families, and the public overall. Whether somebody supports one issue vs an opposing issue or if a general ideology doesn’t mesh with that of another, politics is one of the topics in life where almost everyone has some type of an opinion. Knowing this, politicians generally try to cater to the audience from which they are more likely to receive support to secure votes and get elected to whichever position they are running for. Regardless of which side of the political spectrum a candidate falls this practice is very standard. What people may not realize, however, is how similar or dissimilar the rhetoric from one politician from one region may be compared to another politician, possibly from another region of the country or another party line altogether.

Using aggregated speeches from the U.S. Senate during 1997–98 (the 105th Congress) from each presiding senator at that time, I conducted a study to do simply that: compare how similar each senator’s speeches were to each other. For this analysis the aggregated speeches from Joe Biden were used as the control from which all comparisons were made given his current position as United States President to check how distinct he was from his peers at the time.

Data and Methodology

To conduct this analysis, Python was used to load in the speech data from text files and be the code basis for the analysis. Over 1300 speeches over the course of the selected timeframe were recorded as text for each of the 100 Senators, each speech ranging from minutes worth of dialogue to simple several-word statements. For the purposes of this analysis, all speeches for each senator were aggregated into one single document rather than comparing each individual speech to a single selected speech from Joe Biden.

Speeches that consisted of 10 words or less were excluded from the aggregated document so that more meaningful speeches could be selected based on providing more meaningful comparison points. Along with this, words that were considered to not have any real meaning such as me, I, you, etc., otherwise known as stop words, were removed from each document as well as any punctuation detected. These steps were performed to standardize the documents themselves and allow for improved results by throwing out words considered to be “noisy”. Additionally, two different methods of text preprocessing were performed on the documents which will be discussed in more detail below: stemming and lemmatization.

Stemming

Stemming is the process through which a given word is reduced aggressively to its roots, for example turning the word ‘introduction’ into ‘introduc’ and ‘original’ to ‘origin’. The general concept is to further eliminate any extraneous meanings in words and make it so that different versions of the same word can be treated as the same word (e.g. introduction, introducing, introduced).

Lemmatization

Lemmatization on the other hand, while also potentially reducing certain words, generally allows for more context than chopping all words down to their roots. In the previous example ‘introducing’ is kept as the same word instead of reducing it to ‘introduc’, though other words can be reduced or changed. While this typically allows for more sophisticated results the drawback to this method is that it can be more computationally taxing and time-consuming and may not be the best fit for very large amounts of text data.

TF-IDF Vectorization and Cosine Similarity

The final portion of preprocessing needed before a proper similarity analysis can be performed is to perform what is known as TF-IDF Vectorization. TF-IDF stands for term frequency and inverse document frequency which essentially means that performing it on text will transform said text into a meaningful representation of numbers that can be interpreted by a computer. The TF-IDF vectorization process also gives a weight to each dependent upon how often each word appears in the document. Once completed, this data can then be fed into an algorithmic calculation known as Cosine Similarity which measures the similarity between two vectors, however in this case instead of using angles, we use the text that has been transformed into vectors.

Cosine Similarity Results

The result of this analysis using stemming was that Senator Lieberman, a fellow Democrat from Connecticut had the most similar document to Joe Biden’s whereas the senator with the highest lemmatization cosine similarity score was Senator Kyl, a Republican from Arizona. The results from the lemmatization were surprising given that Joe Biden was a Democrat from Delaware, neither the same party as Kyl nor from a state anywhere geographically close to Arizona. The top 10 results from each text preprocessing method can be seen below:

We can also view the average scores for grouped by party affiliation:

We can see that all the scores in general are close to each other and are also all very high overall. It does make some logical sense that the similarity scores would be high across the board since politicians would likely have to have a certain type of speech pattern to be persuasive to whatever their cause may be while also maintaining a standard level of decorum. It is still a surprising result that Senator Kyl would score the most similarly to Biden using any metric. In general, we note that with both methods of text preprocessing Democrats score slightly higher than Republicans when compared to Biden. Below we can see a comparison of the top 20 words found between each of the highest similarity score senators:

Here we can see better how the type of text preprocessing can affect the analysis of similarity comparisons. Lemmatization shows much more context with the words that are selected and shows differences between the most common words. Overall, even if the result is surprising, lemmatization is likely the most reliable method of discerning similarity between text given that it contains more context and can give less weight to words that stemming might erroneously give more weight to.

Predicting Party Affiliation by Speech

A Naïve-Bayes machine learning prediction model was created to see if a good model could be built to correctly predict the party affiliation of a senator by their speech alone. Rather than use either stemming or lemmatization for this model, base text preprocessing was used that only removed stop words and punctuation. The available data was split between a training and test sample, 80/20% to create the model using the training data and then test the model on the test sample, a sample set that the model would never have seen before. This is done to simulate live data as to avoid a machine learning model memorizing the data and instead trying to generalize the data for prediction purposes. The results from the prediction model on the test set can be seen below:

From the results, we can see that the precision of predicting a Democrat correctly is 82% and 100% for a Republican and the recall is the mirrored scores. This means that the model is over-predicting the Senators to be Democrats vs Republicans but also that each time the model predicts that a Senator is Republican it is doing so with 100% accuracy. Given that the results of the analysis show that the model does not predict members of the Republican party with full accuracy the model can be improved. Using lemmatization, in this case, could help improve the model because, as mentioned earlier, it would help to provide more context for the analysis while also disassociating certain words that may not have as much meaning as others to help refine the results. This may also end up reducing the 100% accuracy of predicting members of the Republican party but this would be a welcome trade-off to significantly improve the prediction accuracy of Democrats overall.

Some limitation of this overall exercise and the predictive model created was the amount of data that was compiled and used. The text from which the analysis was based (the 105th Congress mentioned earlier) was solely during a brief period in time and only from members of the U.S. Senate during that time period. As with most models, the more data you have the better your model will perform. In this case the number of documents used for analysis, while not small, was not sufficiently large. One potential robustness check would be to see how the model performs on other iterations of Congress, for example, the 106th Congress data, and compare results.

Given access to more data a better model could certainly be constructed, one that may not predict 100% correctly members of the Republican party as that may be unrealistic but one that could reduce the error in predictions overall significantly.