Exploring the language patterns of US Senators: Uncovering insights into political discourse

Anna Monisso
13 min readFeb 16, 2023

--

One of the main applications of machine learning is text analysis. This latter one lets us extract meaningful information within the text itself.

In this article some examples of text analysis will be provided focusing on the speeches of the American senators during the 105th United States Congress in Washington DC occurred between January 1997 and January 1999 (in Bill Clinton’s presidency). These speeches can be recovered from the following GitHub repository 105-extracted-date. The list of US Senators is included in sen105kh_fix.csv. This is a cross sectional dataset including information on 100 US Senators in office for different states, whose speeches are indeed contained in the GitHub repository 105-extracted-date.

So, we upload the file and we call it doc. We sort the dataset according to the alphabetical order of the senator’s name (lname).

doc = pd.read_csv('https://raw.githubusercontent.com/ariedamuco/ML-for-NLP/main/Inputs/sen105kh_fix.csv', sep = ';')
doc = doc.sort_values('lname')

The following image presents an overview of how the data appears.

First 5 observations in doc

The most relevant variables for us are lname, lstate and party and these will be useful later on.

So, in this article, we will firstly compare the political speeches of two US Senators in the dataset above. We will use the approach of Term Frequency-Inverse Document Frequency (TF-IDF) and compute their corresponding cosine similarity. A discussion on the measure of similarity will follow as well as the comparison between the approaches of bag of words and TF-IDF. Finally, we will compare the speeches of many US politicians with the one of the Senator Joe Biden for Delaware (in office between 1973 and 2009). The aim is to find the speech that is the closest to his one.

A comparison between T. Kennedy’ and J. Kerry’s speeches

As we can see from the dataset uploaded previously, Ted Kennedy and John Kerry were both United States Senators from Massachusetts. The former was in office between 1962 and 2009; the latter was in charge more than 20 years later, from 1985 to 2013. Both politicians were and are part of the Democratic Party and thus, we expect that their speeches are not that different between each other, despite the time difference in their mandates.

doc[doc['lstate']=='MASSACH']

So, we start uploading the text files from the GitHub folder mentioned above and then we combine both files into a list, called speech:

import requests

kerry_url = "https://raw.githubusercontent.com/ariedamuco/ML-for-NLP/main/Inputs/105-extracted-date/105-kerry-ma.txt"
kennedy_url = "https://raw.githubusercontent.com/ariedamuco/ML-for-NLP/main/Inputs/105-extracted-date/105-kennedy-ma.txt"

kerry = requests.get(kerry_url)
speech_kerry = kerry.text
kennedy = requests.get(kennedy_url)
speech_kennedy = kennedy.text
speech = [speech_kennedy, speech_kerry]
speech

The first step is defining a function that preprocesses the data before doing the proper text analysis and this is given by the following code.

def preprocessing_text(text):
words = word_tokenize(text.lower())
tokens = [word for word in words if word not in string.punctuation]
tokens = [token for token in tokens if token not in complete_drop_list]
tokens = [word for word in tokens if len(word)>=3]
stemmer = PorterStemmer()
tokens_lematized= [stemmer.stem(word) for word in tokens]
preprocessed_text = ' '.join(tokens_lematized)
return preprocessed_text

So, as we can see from the function above, we firstly reduce words into lower case, then we filter out the punctuation, remove the words in complete_drop_list, drop tokens with less than two letters, until applying the most commonly stemming algorithm, i.e. Porter stemmer, and joining all tokens together. We decide to use the stemming algorithm over the lemmatizing one because the meaning of the words is not relevant for the purpose of this analysis. In fact, we aim at finding only any similarity between the two speeches, without focusing on the meaning of the words used. Nevertheless, this does not mean that we want to keep words that do not add any useful information to our texts, otherwise these will make the two speeches looking more similar than they actually are. This is the reason why we drop the words we included in complete_drop_list. This list is created by the code below.

drop_list_url = "https://raw.githubusercontent.com/ariedamuco/ML-for-NLP/main/Inputs/droplist.txt"
drop_list = requests.get(drop_list_url)
drop_list = drop_list.text
with open('droplist.txt', 'r') as f:
droplist = [line.strip().replace('"','') for line in f]
stopw = set(stopwords.words('english'))
complete_drop_list = stopw.union(droplist)
other_useless_words = ['docno', 'text', '/text', '/doc', 'doc', '/docno', 'mr.', "n't", 'would', 'President', 'Senator', 'Senators']
complete_drop_list = complete_drop_list.union(other_useless_words)
len(complete_drop_list)

In complete_drop_list we firstly include the droplist.txt that lies in the GitHub repository, which is an already prepared list of words to remove. We decide to use this list because the words it includes do not actually add information to the speeches and they may lead to misleading results in our text analysis. Also, we add another list of useless words to this set of words, called other_useless_words. These are retrieved by applying the preprocessing_text function on the data and then using the code below to generate a bar plot that shows the 50 most frequent tokens. Among these latter ones, we identify those that do not add useful information to the speeches and include them into the list other_useless_words. Finally, we add the stop-words present in the English vocabulary into the complete_drop_list. In total, this list contains 786 words.

preprocessed_words = preprocessing_text(" ".join(speech)).split()

from collections import Counter
dict_counts = Counter(preprocessed_words)
dict_counts

labels, values = zip(*dict_counts.items())
%matplotlib inline
indSort = np.argsort(values)[::-1]

labels = np.array(labels)[indSort][0:50]
values = np.array(values)[indSort][0:50]

indexes = np.arange(len(labels))

plt.bar(indexes, values, color="red")

plt.xticks(indexes, labels, rotation=45)

For simplicity, we plot only the 30 most common words in both speeches, obtained once we processed the data and before adding other_useless_words to complete_drop_list. From this plot, we can see the reason why we have added ‘docno’, ‘text’, ‘/text’, ‘/doc’, ‘doc’, ‘/docno’, ‘mr.’, “n’t”, ‘would’, ‘President’, ‘Senator’, ‘Senators’ to complete_drop_list.

In our analysis we will use the approach of Term Frequency — Inverse Document Frequency (TF-IDF), which is an algorithm that reweighs the importance of the words according to their frequency. It can be written as it follows:

Formula of TF-IDF

where i is the token and j is the document (speech).

According to the formula above, rare or extremely frequent tokens are considered to be less representative and are associated to a lower TF-IDF. On the contrary, those that have a high TF-IDF are the most representative words, which indeed are the most relevant ones within the document. This logic can be clearly shown into the following graph:

Distribution of TF-IDF

Thus, we recall this algorithm using the command TfidfVectorizer, which uses the preprocessor function defined above and takes into account the presence of both unigrams and bigrams within the speeches. Next, we apply this algorithm to the list of documents used.

tfidf_vectorizer = TfidfVectorizer(preprocessor = preprocessing_text, ngram_range=(1,2))
tfidf = tfidf_vectorizer.fit_transform(speech)

The subsequent step is to create a dataframe of TF-IDFs. To do so, we firstly transpose the document-term matrix in order to have the variables as the rows and the observations as the columns of the dataframe. So, the tokens are the variables and the two speeches are the two observations of this dataframe.

tfidf_transpose = tfidf.toarray().transpose()
df = pd.DataFrame(tfidf_transpose, index=tfidf_vectorizer.get_feature_names_out())
df.columns = ['Kennedy', 'Kerry']
df

An overview of the dataframe generated is shown in the following image:

Dataframe produced by TF-IDF approach

We proceed splitting the dataframe into two vectors with 208,502 tokens, corresponding to each of the speeches. In this way, we can compute their cosine similarity.

for i in range(1,3):
globals()["txt" + str(i)] =df[df.columns[i-1]].values.reshape(1, -1)

from sklearn.metrics.pairwise import cosine_similarity
print("Similarity txt1 and txt2:", cosine_similarity(txt1, txt2))

The cosine similarity results to be 0.75, which is very close to 1. This implies that the speeches of Ted Kennedy and John Kerry are pretty similar as we expected. This is explained by the fact that both Senators are part of the Democratic Party.

Bag of words vs TF-IDF approach.

The bag of words approach gives importance to each word according to their frequency within a text, disregarding grammar and word order. On the contrary, the Term Frequency — Inverse Document Frequency changes the weights to words, giving especially importance to words that are neither too frequent nor rare. These latter ones are indeed considered to be not representative and thus, highly downweighted.

Because bag of words and Term Frequency — Inverse Document Frequency work differently, we expect them to produce different cosine similarity.

So, we can check these reasoning through the following code. First of all, we define the function vectorizer, which integrates the bag of words approach. Similarly to TF-IDF, it uses the preprocessing_text function and considers both unigrams and bigrams. Then, we create the dataframe including the frequency of every token in both speeches. Finally, we compute the cosine similarity and this results to be 0.79. So, the bag of words approach would suggests a higher degree of similarity between the two speeches compared to the TF-IDF approach. This can be explained by the fact that this former approach does not downweight the most frequent words in both speeches and thus, it is reasonable that they look more similar.

vectorizer = CountVectorizer(preprocessor = preprocessing_text, ngram_range=(2,2)) # to get df1

vector = vectorizer.fit_transform(speech)
vector_transpose = vector.toarray().transpose()

df_bag = pd.DataFrame(vector_transpose, index=vectorizer.get_feature_names_out())
df_bag.columns = ['Kennedy', 'Kerry']
df_bag

for i in range(1,3):
globals()["txt_bag" + str(i)] =df_bag[df_bag.columns[i-1]].values.reshape(1, -1)

print("Cosine similarity txt1 and txt2:", cosine_similarity(txt_bag1, txt_bag2))

This is how the dataframe of frequency of each tokens looks like.

Dataframe produced by bag of words approach

In case of n-grams, both bag of words and TF-IDF approaches can take into account them and they are not split into single terms. In fact, both CountVectorizer and TfidfVectorizer functions have the argument ngram_range, in which you can specify the lower and upper boundary of the range of n-values for different n-grams to be extracted.

What is an alternative similarity measure to the cosine similarity?
There are other possible measures of similarity, which changes according to the type of variables you are using. For instance, in case of continuous variables, we can use either Euclidean distance or Manhattan distance; in case of categorical variables, we can use the simple similarity score or the Jaccard’s coefficients. If variables are mixed, i.e. some are continuous and others are binary, we can employ the Gower’s index. However, these are only few examples.

Since Term Frequency — Inverse Document Frequency are continuous variables, we can use the Euclidean distance as an alternative to the cosine similarity. This is the root of the sum of squares of the differences between the corresponding TF-IDFs. The closer this measure is to 0, the more similar the two speeches are.

The image below shows the comparison between the cosine similarity, denoted by θ, and the Euclidean distance, denoted by d. As we can notice, there are differences between these two measures of similarity/distance. In particular, the Euclidean distance does not normalize the length of the texts and thus, it can be very high for two documents that are very similar in the content but different in length. On the contrary, the cosine similarity takes the length of the text into account, and it adjusts accordingly. Thus, it will be more reliable than the Euclidean distance in case of documents of various length.

Euclidean distance vs Cosine similarity

So, we compute the Euclidean distance for our text analysis problem.

from sklearn.metrics.pairwise import euclidean_distances
print("Euclidean distance txt1 and txt2:", euclidean_distances(txt1, txt2))

In our case, the Euclidean distance is 0.69, which means that the two speeches are relatively different. However, the cosine similarity suggests us that the two speeches are more similar than what the Euclidean distance would imply. This different result may be due to their different text lengths. So, we check this by writing the following code and we get the output below.

speech_kennedy = [speech_kennedy]
speech_kerry = [speech_kerry]

tfidf1 = tfidf_vectorizer.fit_transform(speech_kennedy)
tfidf2 = tfidf_vectorizer.fit_transform(speech_kerry)
Dimension of Document-Term matrix of TD-IDFs

So, after inserting the two speeches into separate lists, we apply the TF-IDF function generated above, and we observe that the resulting dimensions of these matrices are different. The preprocessed file of the speech of T. Kennedy is way longer than the speech of J. Kerry. This explains why the Euclidean distance of the two texts conveys a different story than the cosine similarity. Thus, we should rely on the latter one. The two speeches are pretty similar to each other!

Note that we could have avoided to apply the TF-IDF approach on the text and directly observed that the lengths of the two preprocessed speeches are clearly different between each other, as shown below.

preproc_kennedy = preprocessing_text(speech_kennedy)
preproc_kerry = preprocessing_text(speech_kerry)

print(len(preproc_kennedy))
print(len(preproc_kerry))
Lengths of the two preprocessed speeches

Which is the closest US senator’s speech to Biden’s one?

We upload all the text files included in the GitHub repository 105-extracted-date through the command glob.glob, which are 100 and are in alphabetical order. Then, we call each senator’s speech as speechX, where X is a number from 0 to 99 according to the position of the speech (in the alphabetical order) within the folder.

import glob
glob = glob.glob('C:/Users/Anna Monisso/Desktop/UNIBO/CEU/corsi/Machine Learning for NLP/ML-for-NLP-main/Inputs/105-extracted-date/*.txt')
glob

for i in range(0,100):
globals()["speech" + str(i)] = open(glob[i]).read()

Next, we create a list of some speeches. Because of low computation power, we will consider only a subset of these 100 text files, but note that the same following code can be run for all of them.

So, we create the list subset containing the first 50 speeches. Note that here may be other more optimal ways to do it (through loops), but we use the manual approach.

subset = [speech0, speech1, speech2, speech3, speech4, speech5, speech6, speech7, speech8, speech9, speech10, speech11, speech12, speech13, speech14, speech15, speech16, speech17, speech18, speech19, speech20,speech21, speech22, speech23, speech24, speech25, speech26, speech27, speech28, speech29, speech30, speech31, speech32, speech33, speech34, speech35, speech36, speech37, speech38, speech39, speech40, speech41, speech42, speech43, speech44, speech45, speech46, speech47, speech48, speech49]We recall the TF-IDF approach used before and its corresponding function. Thus, we can apply the same function tfidf_vectorizer to the list just generated. We need to employ parallel computing to speed up the process, as the time requested to run the following code is pretty long given the high amount of data. 

So, now we apply the same process of before, just considering more speeches. First step is to apply the tfidf_vectorizer function previously generated, using the same preprocessing_text function and considering the presence of both unigrams and bigrams.

 tfidf_all = tfidf_vectorizer.fit_transform(subset)

Again, we create a dataframe of TF-IDFs, where the variables are the tokens and the observations are the 50 US Senators considered. Because we have considered the first 50 US Senators’ speeches in the folder that are in alphabetically ordered, we name the columns of this new dataframe as the first 50 US Senators’ names in the dataset doc, uploaded at the beginning.

tfidf_all_transpose = tfidf_all.toarray().transpose()
df_all = pd.DataFrame(tfidf_all_transpose, index=tfidf_vectorizer.get_feature_names_out())

name = doc['lname']
subset_name = name[0:50]
df_all.columns = subset_name
df_all

Here, how the dataset looks like.

Dataframe produced by TF-IDF approach

Similarly, we split the dataframe into 50 vectors for each speech.

for i in range(0,50):
globals()["txt" + str(i)] =df_all[df_all.columns[i-1]].values.reshape(1, -1)

So, now we are ready to compute the cosine similarity between the corresponding txt of Joe Biden (number 6) and each of the other speeches txt. First, we create the dataframe cos_sim_df to store the cosine similarity of each senator in the first column, while we include the names of the 50 US Senators in the second one. For the second column we recall the column lname of the dataset doc and we create a subset of it, including the names of the first 50 US Senators. This column is then transformed into a list and used in cos_sim_df.

name = doc['lname']
subset_name = name[0:50]
rows = list(subset_name)

d = {'CosineSimilarity': range(0,50), 'SenatorName': rows}
cos_sim_df = pd.DataFrame(data=d)

Now, we can compute the cosine similarity, between the txt6 and the vector of each senator. Because, the following loop will also compute the cosine similarity of between txt6 and itself, we drop the corresponding row from the dataframe.

from sklearn.metrics.pairwise import cosine_similarity

for i in range(0,50):
cos_sim_df['CosineSimilarity'][i] = cosine_similarity(txt6, globals()["txt" + str(i)])
cos_sim_df = cos_sim_df.drop([6])

The following image shows how the cos_sim_df looks like.

Overview of cos_sim_df

To retrieve the maximum cosine similarity, i.e., the closest US senator’s speeches to J. Biden’s ones, we write the following code and we obtain the output below.

print(cos_sim_df.loc[cos_sim_df['CosineSimilarity'] == cos_sim_df['CosineSimilarity'].max()])
Output of the maximum cosine similarity

Apparently, the US senator, whose speech is the closest to Joe Biden’s one, is Jesse Helms. This is quite surprising given that the former one was part of the Democratic Party, while the latter one was leader of the conservative movement in USA, which is a political movement within the Republican Party, influenced by conservative and Christian media organizations. Moreover, in the subset of speeches considered almost half of the Senators were part of the Democratic Party like J. Biden. However, we can notice that the cosine similarity is not that high either, reaching only 0.36. So, the two speeches are actually pretty different between each other. It is also worth to mention that Joe Biden in the past was more conservative and traditionalist compared to his current political positions and so, “closer” to the Republican Party ideologies. For example, he was one of the main opponents in the Senate to race-integration busing during the 1970s. Also, in 1993 Biden voted for a provision banning homosexuals from the military forces, and three years later he voted for the Defense of Marriage Act, which banned the federal government from recognizing same-sex marriages.

--

--