Utilizing Text Analysis in Predicting Party Affiliations and Identifying Speech Similarities in the 105th Senate

Ian Brandenburg
CEU Threads
Published in
13 min readMar 24, 2024
Image Generated By ChatGPT: “Political Speech Text Analysis”

Text data is something that has rapidly become more important to analyze on a quantitative level. Whether it is comments or reviews, or even political speeches, analyzing text data quantitatively can yield results effective for machine learning, in addition to similarity analysis.

This research project will take a look into speech similarities in the 105th Congress — from January 3, 1997, to January 3, 1999 — and dive into some of the different predictive models available for determining which political party a politician is affiliated with based on their speech data.

The ultimate goal is to determine which senators have the most similar speeches to Biden and compare their political party affiliation and state.

Text Preprocessing

The speech data underwent a series of text preprocessing methods, including standard preprocessing, stemming, and lemmatizing. These techniques were separated for the purpose of comparison. The first method for text preprocessing involved altering the text to be entirely lowercase, tokenizing all the words, and removing any punctuations, numbers, symbols, and stop words. Each word is treated as a token so that it can be looked at individually. Furthermore, the ‘stop words’ are a list of words that are deemed unnecessary for the analysis, and therefore removed. The scikit-learn stop words list ENGLISH_STOP_WORDS was used over other options since this list contains a very robust set of words. The function used for preprocessing is shown below.

The next method employed Porter’s Stemmer, which stems the text to its root form, ultimately simplifying the text and increasing the chance of matching words with similar roots. For example, the words “running”, “runner”, “runs” could get altered to just “run”. This method was used on the already preprocessed text. The following function was created for stemming the preprocessed text.

Text lemmatizing was the next text processing method encorporated, utilizing the WordNet Lemmatizer. Lemmatizing attempts to alter words to their base dictionary form, known as a “lemma”. Unlike stemming, lemmatization attempts this by using lexical knowledge bases. For example, the correct lemma for the word “mice” would be “mouse”, so the algorithm would change the word to “mouse”. The lemmatization algorithm is more complex than stemming, because it attempts to consider the meaning and positioning of a word in a sentence to produce the correct lemma. The following function was developed for executing the WordNet Lemmatizer on the speeches.

Vectorizing and Cosine Similarity

Two vectorizing methods were utilized for performance comparison: count vectorizing and TF-IDF vectorizing. Count vectorizing, also known as the Bag of Words (BoW) method, is a simplified method that counts the number of times a word occurs. There are many limitations to this method because it does not consider the commonality of certain words. Thus, the Term Frequency-Inverse Document Frequency (TF-IDF) vectorizing method was used. It incorporates the count vectorizing technique while taking a step further by giving importance values to words that are unique to a certain text to highlight the words that portray the meaning of the text more accurately. Once these vectorizing methods were conducted, they were analyzed through cosine similarity.

I’ll use cosine similarity to determine which senators have the most similar speeches to Biden in the 105th Senate. This method measures the cosine of the angle between two vectors, where if the vectors are the same, the angle would be 0 degrees, and the cosine similarity would be 1. If the vectors are completely different, the angle would be 90 degrees, thus a cosine similarity of 0. This was calculated by using the TF-IDF and BoW vectors.

Determining the Top Five Most Similar Senators’ Speeches to Biden’s

Four different measures were taken in analyzing the most similar speeches to Biden’s. Political party affiliations were incorporated into the visualizations to compare to Biden’s, which is Democrat. None of the senators in the top five were ever from the same state as Biden. Each of these measures was conducted for the purpose of comparing vectorizing and preprocessing methods. The first cosine similarity measure was calculated on the preprocessed text without stemming or lemmatization while using TF-IDF vectorizing.

Cosine Similarity Rankings on TF-IDF Standard Preprocessed Speeches

Senator John Kerry of Massachusetts is seen to have the most similar speeches to Joe Biden, with a mixture of Republican and Democratic senators following, and Senator Kyl in the second spot. This could suggest that the speeches are not good measures for predicting political party affiliations. Furthermore, none of these senators are from the same state as Biden. Cosine similarity was next calculated on the stemmed text data using TF-IDF vectorizing.

Cosine Similarity Rankings on TF-IDF Stemmed Speeches

Similar results are yielded, with Senator Kerry and Kyl remaining the two most similar. There is still a mixture of political parties listed, but a noticeable change with Senator Hutchison no longer being in the top 5, and being replaced by Senator Byrd. The next text-processed method analyzed with cosine similarity was lemmatization using TF-IDF vectorizing.

Cosine Similarity Rankings on TF-IDF Lemmatized Speeches

Senator Kerry and Kyl remain the two most similar, with the last three being similar to the standard preprocessing method without stemming or lemmatization. Finally, cosine similarity was calculated on the bag of words counts vectorizing method, and applied to the standard preprocessed text to see how this compares to the stemming and lemmatizing techniques.

Cosine Similarity Rankings on BoW Standard Preprocessed Speeches

Senator Kerry remains the number one most similar senator, while Senator Kyl is no longer in the ranking. This could suggest that Senator Feinstein here uses a lot of common words similar to Biden, while Senator Kyl may use more contextually important verbiage that is comparable to Biden.

The cosine similarity averages of these methods are displayed below for text analysis method comparison:

Stemming seems to yield the most similar results. This could be viewed as a limitation, since sometimes the stemming algorithm can mistakenly generate incorrect root words to unrelated words, creating a higher similarity rate. Nevertheless, based on these results, the stemming method may be the most optimal method to proceed forward with. Thus, the top five most similar senator speeches to Biden are by Kerry (D), Kyl (R), Lieberman (D), Roberts (R), and Byrd (D). The party affiliation does not seem to be consistent with Biden’s. To see which party most relates to Biden, the average of cosine similarity scores using the TF-IDF stemmed speeches was calculated and visualized.

The Democrat Party can be seen to have a slightly larger cosine similarity average as compared to the Republican Party. This suggests that political affiliation does have some influence on similarity, but is not necessarily causal. With these mutually similar bi-partisan similarity scores, these text analysis techniques may not be the most efficient in determining the political affiliation of a politician, and the political party may not explain why a senator’s speeches are so similar to Biden’s. There are large differences in the opinions held by each political party, so narrowing down the text to target even more meaningful text may be necessary for future research.

Top Politician Comparison

The most consistent top cosine similarity score, no matter what text processing or vectorizing technique was in use, was Senator John Kerry of Massachusetts. His speeches remained consistently the most similar to Joe Biden’s. Senator Kerry has several political connections to Biden, through serving as the Secretary of State during the Obama administration with Biden as the Vice President, and assisted in founding the Special Presidential Envoy for Climate, becoming a member of the United States National Security Council under the Biden administration. Kerry has clear connections with Biden, so the speech similarity should not come as a huge surprise.

The top words from Biden and Kerry’s speeches were extracted and displayed in a count bar chart for comparison. Since the stemming method had the highest similarity scores on average, the most frequent words were extracted from the stemmed speeches.

Kerry’s Most Frequent Words
Biden’s Most Frequent Words

Some of the words displayed are not actually words, which is a fault in the stemming algorithms. Nevertheless, their meanings can be relatively understood. The words that both senators had most in common were: “senat”, “state”, “make”, “nation”, and “year”. It is possible that these significantly boost the similarity scores between Biden and Kerry’s speeches. They do seem fairly generic political words, suggesting that future research creates a politician stop-word list for more accurate similarities. This project is looking at the most similar speeches to Biden’s, but may contain too much text noise in the analysis to most accurately represent the most similar speeches.

Predicting Political Party Using Text

Three predictive modelling techniques were employed to compare results: Logistic Regression, Multinomial Naives Bayes, and Guassian Naive Bayes. Logistic regression works well for binary classification tasks, such as political party. Multinomial Naive Bayes performs good on word count vectors, such as the Bag of Words vector. Finally, the Gaussian Naive Bayes typically is used for continuous variables, and is not typical for text analysis, but will be used for comparison purposes. The test set was a 50% split of the 100 total senators in the full set.

The best accuracy score shows that the speeches can predict the party of the politician 86% of the time. The same text preprocessing from above was used for the purpose of comparing each text preprocessing technique to determine if certain methods yielded greater predictive power.

Each of the text processing methods were tested in the predictive modeling to see how the results vary based on the processing or vectorizing type.

The most accurate models seem to be the logistic regression models. They perform with a 86% accuracy in predicting the 50% test set for political parties based on the processed text; however, there are minimal discrepancies between the different text processing methods when making the predictions in the logistic regressions. This is surprising, as the cosine similarity scores vary based on the text processing method.

In the logistic regression, the TF-IDF for standard processing achieves a 86% accuracy, suggesting that this text processing and vectorizing technique has strong predictive power. The TF-IDF stemmed text has the same accuracy, which could suggest that stemming actually does not influence the predictive power as compared to the TF-IDF for standard processing. Furthermore, the BoW performs at the same predictive accuracy score, suggesting that TF-IDF and Count Vectorizer do not significantly alter the predictive power of the model. However, the TF-IDF for the lemmatized text decreases the accuracy score to 0.84%, suggesting the lemmatizing decreases the predictive power.

As expected, the BoW standard text processing performs the best in the Multinomial Naives Bayes model with a 0.84 accuracy score, as this model is best used with count vectors in text. The three TF-IDF text processing methods did not perform as good, all at an accuracy of 0.54. This also suggests that the addition of stemming and lemmatizing does not change the predictive power in the Multinomial NB model.

The Gaussian Naives Bayes model does not perform very well compared to the Multinomial Naives Bayes or Logistic Regression models. This was expected, as the Guassian NB is designed for continuous variables. The TF-IDF and counter vectors can pass through the Guassian NB mdoel, but is not ideal for predicting with text data on binary variables.

Conclusion

The use of text analysis preprocessing methods, vectorization, and cosine similarity has allowed us to determine the most similar senator’s speeches to Biden, which was John Kerry in the 105th Congress. Text stemming seemed to yield the highest similarity scores across each of the different types of text preprocessing methods. The political party with the highest similarity rate to Biden was the Democrat Party. However, the Republican Party still had a comparable cosine similarity score to the Democrat Party. This could be related to the amount of noise in the data, which acts as a limitation to the study. Further preprocessing measures should be taken into consideration in the future, such as developing a stop-word list specifically for political speeches.

To further this study, the vectorized text data was used for attempting to predict the party of the politician based on their speech data. This was accomplished at an 86% accuracy rate in predicting the 50% test set. The best model was found to be the logistic regression model, which is very effective for binary variables, such as a senator’s political party in a two-party government. The sample size could be increased to enhance the accuracy. Furthermore, investigating and adjusting the hyperparameters of the models could potentially enhance their performance in future research.

GitHub

The GitHub associated with this project can be found here.

Comparison to Previous Projects

Several students in the past have conducted a similar study. This portion will review their findings and discuss the differences in methodology and results, to discuss how to further improve this analysis.

Sherkhan: What do US Senators say? A text similarity analysis of the Senatorial speeches of the 105th US Congress

One of the key differences between my project and Sherkhan’s is that the preprocessing function is different, yielding starkly different results. Sherkhan removed words that are less than 3 characters long, which could significantly influence the results. The idea Sherkhan had to remove the words by length could suggest stronger results, as many of these words this length generate additional noise and are not needed. However, the stop words list applied should have taken care of most of these words. Additionally, Sherkhan identified many republican senators with speeches similar to Biden’s. This is consistent with my analysis, where the political party may not very well predicted by the TF-IDF, or rather, Biden tended to be more center in the 105th Congress.

Monisso: Exploring the language patterns of US Senators: Uncovering insights into political discourse

Monisso explored different possible methodological approaches in determining the most similar speeches to Biden. Similarly, she compares the TF-IDF and BoW vectorizing. The preprocessing function used by Monisso removes words with two letters or less and incorporates a large amount of stop-word sources. This could strengthen the comparison between senators by removing highly repeated words. Incorporating these stop words in future research could benefit the predictive modelling for determining the political party based on a speech. The final results found by Monisso have significantly lower similarity scores as compared to my project. This is likely due to the stop word removal done by Monisso. Something that Monisso did in her project, which I found interesting, was only analyzing the first 50% of the senator's speeches. With an already small sample size, this could make developing a predictive model more difficult.

Hamberger: Deciphering polititalk: A natural language processing approach

The article published by Hamberger displays a very interesting word frequency visualization. These are very nice visualizations, which show actual words in a diagram that allows you to see the most frequent words. This is important to analyze because it can help you determine important stop words to include. In the preprocessing function developed by Hamberger, words with less than four characters are removed, which is much stricter than the previous articles. There is ground to argue for this decision, as many three or fewer letter words may not be very helpful towards the meaning of a speech. I think there may have been an indexing error in Hamberger’s analysis because only Biden’s speech will return a cosine similarity of 1. Perhaps, Biden’s text was not removed from the list of speeches, since Hamberger read Biden’s speech individually. Nevertheless, Hamberger made use of the Jaccard Similarity and noted that it was effective. This could be used in future research on speech similarity studies.

Moon: Text similarity analysis on Speeches by US senators

Moon also makes use of very interesting visualizations, that allow for a very easy understanding of the results. Notably, the word frequency visualization is very effective for this, and I think this could definitely be used in future research. Moon removes words that are less than three characters in the preprocessing function, which may cause differences in our results. Additionally, Moon found that the Jaccard Similarity is actually significantly less effective compared to the Cosine Similarity on the TF-IDF, which contradicts the findings to Hamberger’s report. Moon also compares BoW, N-Grams and Euclidean Distance Similarity, but still determined that the cosine similarity using the TF-IDF was the most effective method. The end results display an array of politicians from different parties, furthering my findings that Biden’s speech is not only similar to other politicians in the Democrat Party.

Assan: I talk just like my friends

The report Assan published shows a significantly lower cosine similarity score in the most similar senator as compared to my project. This could be related to two different things. In the loading stage, Assan refers to the HTML code to extract the text from within the `<DOC> </DOC>`, and notably uses BeautifulSoup (frequently found in Web Scraping scripts to extract text from HTML code), to extract the text from the `<TEXT>` key. This could change the results by integrating each individual speech into a list, which could decrease the similarity score. Additionally, Assan removes words less than three characters long. This also may have influenced the results to be different than mine. Assan notably compared several different similarity measures, including Cosine, Jaccard, Euclidean Distance, Manhattan, and Pearson Correlation. Assan found that the Cosine Similarity produces the most effective results for speech comparison. The end results displayed Republicans having a similar overall similarity score to Democrats.

Overview

Out of each of these reports, one of the most repetitive themes was removing words less than three characters long in the preprocessing function. This was something not taken into consideration in my report. It would most likely strengthen the results in future research. The additional visualizations presented by Moon and Hamberger are also good points to consider for presenting these types of projects. Nevertheless, Cosine Similarity seems to be the best option for comparing TF-IDF vectors.

--

--

Ian Brandenburg
CEU Threads

Student in Business Analytics studying web scraping, statistical analytics, and python!