Deciphering polititalk: A natural lanuage processing approach

Caroline Hamberger
8 min readFeb 16, 2023

--

In 2011, Business Insider very aptly asked: Why do all politicians sound the same? The average American, bored by the repetitive political discourse that the 24-hour new cycle produces, may have been asking themselves this question for even longer. Today, I’ll be dipping my toes into the treacherous waters of politics to try and determine just how similar all those long-winded speeches end up being.

The speeches

Imagine if you will: the year is 1997. Bill Clinton is president, Elton John’s “Candle in the Wind” is at the top of the charts, and I wasn’t even born yet — simpler times indeed. With the beginning of the year, the 105th United States Congress it staring its legislative period — and that means lots and lots of speeches. We’re starting off our little experiment with Tom Campell (Democrat) and John McCain (Republican). After reading in two of their speeches, it became time to preprocess them: take out stopwords, punctuation, irrelevantly short words and tokenize everything.

def text_preprocesser(text):
text= re.sub(r'\W',' ', text)
tokens = word_tokenize(text.lower())
tokens = [token for token in tokens if token not in stopwords.words('english')]
# also throws out irrelevant words
tokens = [word for word in tokens if len(word)>=4 and word not in ["text", "docno", "campbell", "mccain"]]
preprocessed_text = ' '.join(tokens)
return preprocessed_text

I then tokenized both speeches and counted each instance of a word used. This gave me a first, very simple bar chart of the most frequent words in both speeches. And because this wouldn’t be data science without it, I also threw in a word cloud.

Now it’s time to get into the actual meat of the story: applying a TF-IDFvectorizer. It’s a common tool used for converting text into numerical vectors. Each vector element represents the importance of a word in a document, based on its frequency in the document. I gave the order to only include those words that appeared a minimum of twice in the text to avoid outliers.

#count vector
count_vectorizer = CountVectorizer(preprocessor = text_preprocesser, min_df =2)

speeches = [campell, mccain]
counts_bow = count_vectorizer.fit_transform(speeches)

# turning the speeches into a pandas data frame
df_bow= pd.DataFrame(counts_bow.toarray().transpose(), index=count_vectorizer.get_feature_names_out())
df_bow.columns = ["Campbell", "McCain"]

for i in range(1, 3):
globals()["txt" + str(i)] =df_bow[df_bow.columns[i-1]].values.reshape(1, -1)

As our first measure of similarity, we’re using cosine similarity:

print("Cosine similarity Campell and McCain speeches:", cosine_similarity(txt1, txt2))
# Cosine similarity Campell and McCain speeches: 0.84615221.

What this tells us is that the two speeches are similar, with a similarity score of 84.61% out of a possible perfect similarity score of 1.

Bag of Words

Now, in natural language processing, lots of words get thrown around: for example, “bag of words”. The difference, in this case, would be using a Count Vectorizer instead of a TF-IDF vectorizer. So let’s do that and see what it tells us!

from sklearn.feature_extraction.text import CountVectorizer

# creates count vector
# min_df removes terms that appear less then twice
count_vectorizer = CountVectorizer(preprocessor = text_preprocesser, min_df =2)

speeches = [campell, mccain]
counts_bow = count_vectorizer.fit_transform(speeches)

We’re using our dear cosine similarity again.

print("Cosine similarity Campell and McCain speeches:", cosine_similarity(txt1, txt2))
# Cosine similarity Campell and McCain speeches: 0.84615221

You may notice that the cosine similarity is the same in both cases. That’s what we expected to happen! This is because cosine similarity is a measure of similarity between two non-zero vectors, and the vector representation of the texts does not depend on the method used to create the vectors. The cosine similarity is determined by the angle between the vectors, and this angle does not change even if the method used to create the vectors changes.

This also stays approximately the same when including n-grams (or groups of words that belong together). Let’s first look at the code.

# adding n-gram range to the count vectorizer
count_vectorizer = CountVectorizer(preprocessor = text_preprocesser, min_df =2, ngram_range=(1,3))

counts_bow_n = count_vectorizer.fit_transform(speeches)

## also creating a pandas data frame etc.

print("Cosine similarity Campell and McCain speeches:", cosine_similarity(txt1, txt2))
# Cosine similarity Campell and McCain speeches: 0.83232158

For the n-gram use, the lower similarity score is a reflection of the fact that n-grams can capture more specific details in the text, and the speeches by Campbell and McCain may not share these specific details. This would result in a lower similarity score when using n-grams.

Additionally, the similarity between the Campell and McCain speeches may be lower when including n-grams because the inclusion of n-grams adds additional features to the count vectorizer. These features may not be relevant or may even be noise, and can dilute the signal of the more important features. When the count vectorizer is transformed into a matrix and the cosine similarity is calculated, the additional features can impact the calculation, potentially leading to a lower similarity score.

Another way of measuring

Let’s now look at a new way of comparing these two speeches: the Jaccard similarity!

Jaccard similarity is a measure of similarity between two sets (in this case the speeches). It is defined as the size of the intersection of the sets divided by the size of the union of the sets. In the context of text similarity, Jaccard similarity measures the overlap between the unique words in the two documents, ignoring the frequency or order of the words. A Jaccard similarity of 1 means that the two sets are identical, while a Jaccard similarity of 0 means that the two sets have no elements in common.

# define Jaccard similarity function
def jaccard_similarity(list1, list2):
set1 = set(list1)
set2 = set(list2)
intersection = set1.intersection(set2)
union = set1.union(set2)
return len(intersection) / len(union)

# calculate Jaccard similarity between the speeches
campbell_set = set(campell_tokens)
mccain_set = set(mccain_tokens)
jaccard_sim = jaccard_similarity(campbell_set, mccain_set)

print("Jaccard similarity between Campbell and McCain speeches:", jaccard_sim)
# Jaccard similarity between Campbell and McCain speeches: 0.3683385579937304

There is a clear difference between the Jaccard similarity (0.3684) and the cosine similarity (0.8461) because the Jaccard similarity is notably a lot lower. This is because cosine similarity considers not just the list of identical words, but the frequency with which those words appear. Jaccard similarity, meanwhile, only considers the presence or absence of words in the two speeches. This means that if some words are repeated frequently in one speech but not in the other, cosine similarity will give those words more weight, while Jaccard similarity will not.

In this case, Jaccard similarity makes sense to use because it focuses on the overlap of the unique words between the two speeches, rather than the frequency of each word. Since both speeches are about a similar topic (presumably politics or government), it is likely that they will have some common themes and topics, but will not necessarily use the same words or phrasing.

Final playing around

To avoid any crashes, I also used a fun and, at least to me, new technique: pickling! It’s essentially where you save your preprocessed files as “pickles” easy-to-read file formats that avoid you having to redo all your work once you close your notebook.

import os
import pickle

input_folder = "105-extracted-date/"
preprocessed_folder = "105-extracted-date-preprocessed/"

# saving all the files
def preprocess_and_save(file_path):
with open(file_path, "r") as f:
text = f.read()
preprocessed = text_preprocessor(text)
tokens = preprocessed.split()
file_name = os.path.splitext(os.path.basename(file_path))[0] + ".pickle"
with open(os.path.join(preprocessed_folder, file_name), "wb") as f:
pickle.dump(tokens, f)

# preprocessing and tokenizing each file in the input folder, and save to a file in the preprocessed folder
if not os.path.exists(preprocessed_folder):
os.makedirs(preprocessed_folder)
for file_name in os.listdir(input_folder):
file_path = os.path.join(input_folder, file_name)
preprocess_and_save(file_path)

Now, since everything is now sufficiently crash-proof, it’s time to load them back in to get on with the analysis.

# loop through all files in the directory
for file_name in os.listdir("105-extracted-date-preprocessed"):
if file_name.endswith(".pickle"):
name = file_name.split("-")[1]
with open(f"105-extracted-date-preprocessed/{file_name}", "rb") as f:
tokens = pickle.load(f)
text = " ".join(tokens)
with open(f"105-extracted-date-preprocessed/{name}.txt", "w") as f:
f.write(text)

# saving all speeches into one big directory
path = "105-extracted-date-preprocessed"
all_speeches = {}

for file_name in os.listdir(path):
name, ext = os.path.splitext(file_name)
if ext == ".txt":
with open(os.path.join(path, file_name), "r", encoding="iso-8859-1") as f:
content = f.read()
all_speeches[name] = content

Finally, let’s also pre-process Biden and generate that final list of sorted cosine similarities!

# loading the "biden" speech
biden = open("105-extracted-date/105-biden-de.txt").read()

## regular biden preprocessing

# weighing those words that appear less as less important
tfidf = tfidf_vectorizer.fit_transform(list(all_speeches.values()))

# getting the index of the "biden" speech (should be 0)
biden_index = list(all_speeches.keys()).index('105-biden-de.txt')

# getting the vector representation of the "biden" speech
biden_vector = tfidf[biden_index].toarray().flatten()

# looping through all speeches and calculate cosine similarity
similarity_dict = {}
for name, speech in all_speeches.items():
speech_index = list(all_speeches.keys()).index(name)
speech_vector = tfidf[speech_index].toarray().flatten()
similarity = cosine_similarity(biden_vector.reshape(1, -1), speech_vector.reshape(1, -1))[0, 0]
similarity_dict[name] = similarity

# sorting the similarity dictionary by descending order of cosine similarity
sorted_dict = {k: v for k, v in sorted(similarity_dict.items(), key=lambda item: item[1], reverse=True)}

This gives us our final output!

# printing the sorted dictionary
print("Cosine similarity between Biden speech and other speeches:\n")
for name, similarity in sorted_dict.items():
print("{:<25} {:<10.4f}".format(name, similarity))

# Cosine similarity between Biden speech and other speeches:
abraham 1.0000
dodd 0.7440
dewine 0.7420
craig 0.7273
bingaman 0.7226
ashcroft 0.7177
campbell 0.7109
enzi 0.7009
daschle 0.6849
durbin 0.6836
boxer 0.6812
coats 0.6753
baucus 0.6715
byrd 0.6706
collins 0.6616
bond 0.6552
cleland 0.6528
dorgan 0.6524
bryan 0.6486
brownback 0.6474
conrad 0.6423
burns 0.6391
breaux 0.6385
allard 0.6326
bumpers 0.6177
coverdell 0.6151
bennett 0.6048
chafee 0.5877
akaka 0.5708
domenici 0.5693
cochran 0.5657
damato 0.5294

Spencer Abraham, with a perfect similitarity score of 1.0 can probably be disregarded as an outlier. Well either that, or there’s been some serious plagarism!

Looks like Biden shares the largest similarities with Senator Christopher J. Dodd — a Democrat. Any suprise on your part?

Final thoughts

I really liked this little look into some political rhetoric. It saved me a fair amount of time that I would have had to spend on actually reading all those speeches! The Jaccard similarity proved to be notably helpful, especially when you want to have a look at speech similarity independently from the topic at hand.

--

--