I talk just like my friends

11 min readFeb 16, 2023

Exploring Similarity Measures for NLP Tasks

Natural Language Processing (NLP) is an exciting field that involves training machines to understand human language. Many problems in the NLP domain involve computing the similarity or distance between two pieces of text. For example, we might want to compare the similarity between two speeches or two product reviews to identify whether they discuss the same topic or opinion. There are several similarity measures in the NLP domain, and in this article, we will explore some of the most commonly used measures, including Cosine Similarity, Jaccard Similarity, Euclidean Distance, Manhattan Distance, and Pearson Correlation Coefficient.

To showcase how these similarity measures work in practice, we will use a dataset of speeches from the United States Senate. We will extract the text from each speech and preprocess the text by removing stop words, and non-word characters, and performing lemmatization. After that, we will explore the frequency of words in the speeches and vectorize the speeches using the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization approach. Then, we will calculate the similarity measures and analyze the results.

Data Acquisition

The first step in our analysis is to extract the text from the Senate speeches. We have a folder containing several XML files containing information about different speeches. We can extract the text content from each file using Python’s Beautiful Soup library. The code below shows how we can achieve this.

import os
import pandas as pd
import chardet
from bs4 import BeautifulSoup

# Path to the folder containing the files
folder_path = "/Users/mukhamejan/Desktop/school/Winter_23/ECBS6253/ML-for-NLP-main/Inputs/105-extracted-date"

# Create an empty dictionary to store the text content with senator names as keys
senator_text = {}

# Loop through each file in the folder
for filename in os.listdir(folder_path):
    if filename.endswith(".txt") and filename.startswith("105-"):
        # Extract senator name from the filename
        senator_name = filename.split("-")[1]

        # Detect encoding and read the file content
        with open(os.path.join(folder_path, filename), "rb") as f:
            result = chardet.detect(f.read())
            file_encoding = result["encoding"]
        with open(os.path.join(folder_path, filename), "r", encoding=file_encoding) as f:
            xml_content = f.read()

        # Remove any extra content after the XML document
        xml_content = xml_content[:xml_content.rfind("</DOC>") + len("</DOC>")]

        # Parse the file content using BeautifulSoup
        soup = BeautifulSoup(xml_content, "xml")

        # Extract the text content from the <TEXT> element
        text = soup.find("TEXT").text.strip()

        # Store the text content with senator name as the key in the dictionary
        senator_text[senator_name] = text

# Create a Pandas DataFrame with a column for each senator's text content
df = pd.DataFrame.from_dict(senator_text, orient="index", columns=["Text"])

The senator_text dictionary stores the text of speeches with senator names as the keys, and the df DataFrame contains the text of speeches, with each senator's text as a column.

Text Preprocessing

Text preprocessing is a critical step in NLP that helps to clean the text and make it suitable for analysis. In this code, a text_preprocesser function is defined to preprocess the speeches. The function performs the following operations:

Replaces all non-word characters in the text with a space using the re.sub function from the re module.
Tokenizes the text into individual words using the word_tokenize function from the nltk library.
Converts all tokens to lowercase using the lower method.
Filters out any stop words (common words such as ‘a’, ‘the’, etc.) using a list of stop words from the nltk library's stopwords corpus.
Filters out any tokens that have a length of less than 3.
Joins the remaining tokens into a single string using the join method.
Returns the preprocessed text.

def text_preprocesser(text):
    text= re.sub(r'\W',' ', text)
    tokens = word_tokenize(text.lower())
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    tokens = [word for word in tokens if len(word)>=3]
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text

TF-IDF Vectorization

The term frequency-inverse document frequency (TF-IDF) is a numerical statistic that reflects the importance of a word in a document. The TF-IDF value increases with the frequency of a word in the document and decreases with the frequency of the word in the corpus. The code implements TF-IDF vectorization using the TfidfVectorizer class from the sklearn.feature_extraction.text module. The vectorizer is initialized with the text_preprocessor function defined earlier. The fit_transform method of the vectorizer is used to convert the preprocessed text data into a matrix of TF-IDF values. A Pandas DataFrame is then created with the tokens and their TF-IDF values with each document as columns.


# TFIDF Vectorize using the predefined preprocesser. min_df=2 here is not needed, but it does not change anything
# initialise the vectorizer
tfidf_vectorizer = TfidfVectorizer(preprocessor = text_preprocesser, min_df =2 )
# fit the vectorizer
tfidf = tfidf_vectorizer.fit_transform([roth, murray])
# build a data frame with the tokens and their tfidf value with each document as columns
tfidf_rm = pd.DataFrame(tfidf.toarray().transpose(), index=tfidf_vectorizer.get_feature_names())
tfidf_rm.columns = ['roth', 'murray']
# print to see
tfidf_rm

To better understand the texts of the two speeches, I plotted the frequency distribution of the top 10 words in the combined speeches of Roth and Murray using the Python code provided below:

# Apply text preprocesser to a combined string
tokens = text_preprocesser(roth + " " + murray).split()
# Count the tokens
from collections import Counter
dict_counts = Counter(tokens)
dict_counts
# Plot the frequency of top 10 words
labels, values = zip(*dict_counts.items())
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
# sort your values in descending order
indSort = np.argsort(values)[::-1]

# rearrange your data and show top 10 words
labels = np.array(labels)[indSort][0:10]
values = np.array(values)[indSort][0:10]
indexes = np.arange(len(labels))

plt.bar(indexes, values, color="red")
# add labels
plt.xticks(indexes, labels, rotation=45)

This code first preprocesses the speeches by cleaning them, removing stop words, and stemming the words to their root form. It then combines the speeches of Roth and Murray and counts the frequency of each word. The top 10 words with the highest frequency are then plotted in a bar graph.

The resulting graph shows the frequency distribution of the top 10 words in the combined speeches of Roth and Murray. The x-axis represents the top 10 words, and the y-axis shows the frequency of these words.

From the graph, we can see that the most frequent words are “program,” “million,” and “federal,” followed by “year,” “tax,” “budget,” “spending,” “health,” “education,” and “congress.” These words are commonly used in political speeches, particularly when discussing policy and budget issues.

The frequency distribution graph can help us gain insight into the topics discussed in the speeches of Roth and Murray. It shows the most common words used by the two senators, giving us an idea of the focus of their speeches. This information can be used in combination with the similarity measures to understand the similarities and differences between the speeches of Roth and Murray.

Similarity measures

Cosine Similarity:

Cosine similarity measures the similarity between two non-zero vectors of an inner product space. In the context of natural language processing, cosine similarity is commonly used to measure the similarity between two documents represented as vectors of word frequencies or TF-IDF scores. The measure calculates the cosine of the angle between two vectors and returns a value between 0 and 1, where 1 represents the highest similarity between the two vectors.

Example usage in the code:

from sklearn.metrics.pairwise import cosine_similarity

# Get the tfidf values and calculate the cosine similarity value
similarity_rm = cosine_similarity(tfidf_rm['roth'].values.reshape(1, -1), tfidf_rm['murray'].values.reshape(1, -1))[0][0]
print("Cosine similarity between Senators Roth and Murray's speeches (TFIDF): {:.2f}%".format(similarity_rm * 100))

2. Jaccard Similarity:

The Jaccard similarity measures the similarity between two sets of elements. It is calculated as the ratio of the intersection of the two sets to their union. In the context of natural language processing, the Jaccard similarity can be used to compare the similarity of the vocabulary between two documents.

Example usage in the code:

from sklearn.metrics import jaccard_score

# Preprocess the documents and create a set of each
# sets contain only the unique elements in the preprocessed text
set1 = set(text_preprocesser(roth).split())
set2 = set(text_preprocesser(murray).split())

# Convert the sets to lists
list1 = list(set1)
list2 = list(set2)

# Define Jaccard Similarity function for two sets
# The measure is equal to the count of shared tokens over count of total tokens
def jaccard_set(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union

# Apply function
similarity = jaccard_set(list1, list2)

# Print the similarity as a percentage
print("Jaccard similarity: {:.2f}%".format(similarity * 100))

3. Euclidean Distance:

The Euclidean distance measures the distance between two points in space. In the context of natural language processing, Euclidean distance can be used to compare the distance between two vectors representing two documents.

Example usage in the code:

from scipy.spatial.distance import euclidean

# Calculate the Euclidean distance between the TFIDF vectors for each senator's speech
distance = euclidean(tfidf_rm['roth'].values.reshape(1, -1), tfidf_rm['murray'].values.reshape(1, -1))

# Print the distance
print("Euclidean distance: {:.2f}".format(distance))

4. Manhattan Distance:

The Manhattan distance is similar to the Euclidean distance but instead of calculating the distance as the straight line between two points, it calculates the sum of the differences between the corresponding components of the two points. In the context of natural language processing, the Manhattan distance can be used to compare the distance between two vectors representing two documents.

Example usage in the code:

from scipy.spatial.distance import cityblock

# Calculate the Manhattan distance between the TFIDF vectors for each senator's speech
distance = cityblock(tfidf_rm['roth'].values.reshape(1, -1), tfidf_rm['murray'].values.reshape(1, -1))

# Print the distance
print("Manhattan distance: {:.2f}".format(distance))

5. The Pearson Correlation Coefficient

The Pearson correlation coefficient measures the linear relationship between two variables, with a value ranging from -1 to 1, where -1 indicates a perfectly negative correlation, 1 indicates a perfectly positive correlation, and 0 indicates no correlation.

In the context of text similarity, the Pearson correlation coefficient can be used to measure the similarity of the TF-IDF vectors for two texts. In the given code, the Pearson correlation coefficient is calculated between the TF-IDF vectors of the speeches of Senators Roth and Murray. The coefficient is calculated using the pearsonr function from the scipy.stats module, which takes two arrays as input and returns the correlation coefficient and the p-value.

The Pearson correlation coefficient between the TF-IDF vectors of Senators Roth and Murray’s speeches is printed using the following code:

from scipy.stats import pearsonr
correlation, p_value = pearsonr(tfidf_rm['roth'].values.reshape(1, -1)[0], tfidf_rm['murray'].values.reshape(1, -1)[0])
print("Pearson correlation coefficient: {:.2f}".format(correlation))

The output shows the Pearson correlation coefficient between the TF-IDF vectors for the two speeches, which is a value between -1 and 1. A value close to 1 indicates a strong positive correlation, a value close to -1 indicates a strong negative correlation, and a value close to 0 indicates no correlation. In this case, we find out the coefficient to be 0.13.

Based on the Pearson correlation coefficient of 0.13, we can say that there is a very weak positive correlation between the speeches of Senators Roth and Murray. This is not surprising since they belong to different political parties and may have different stances on various issues.

However, it is important to note that Pearson correlation is a measure of linear correlation, which means that it may not capture non-linear relationships between variables. In addition, it only measures the strength of the relationship and not the direction of the relationship, which means that a positive correlation does not necessarily mean that the variables are directly proportional to each other.

So far, we have explored several similarity measures to compare the speeches of two US Senators, William Roth, and Patty Murray. We have used these measures to evaluate the degree of similarity between their speeches and gain insight into their political agendas and speaking styles.

Our analysis shows that the speeches of the two Senators have a very low Jaccard similarity score of 0.06%, indicating that they do not share many common words. This may suggest that they have very different political agendas and ideologies.

The Euclidean distance between the TFIDF vectors for each senator’s speech is 0.83, which is relatively low, indicating some level of similarity between the speeches. Similarly, the Manhattan distance of 3.80 is not particularly high, suggesting that the speeches may have some similar themes and topics.

However, the Pearson correlation coefficient of 0.13 is relatively low, suggesting that there is not a strong linear relationship between the two speeches. This may indicate that they have different speaking styles and may prioritize different issues in their speeches.

Overall, these results suggest that while there may be some similarities between the speeches of Senators Roth and Murray, they have distinct political agendas and speaking styles. It is important to note that these results are based on a single sample of their speeches and may not generalize to their overall political careers or other samples of their speeches. Further analysis would be needed to make more generalizable conclusions.

Finding Biden’s bestie

We can perform a similar analysis and get the similarity scores comparing the speeches of all senators to President Biden’s.

# Find the senator whose speeches are the most similar to Biden's speeches
# Combine all the senator texts into a single list
all_texts = list(senator_text.values())

# Vectorize the texts using TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(preprocessor=text_preprocesser, min_df =2)
tfidf = tfidf_vectorizer.fit_transform(all_texts)

# Calculate the cosine similarity between each pair of texts
cosine_similarities = cosine_similarity(tfidf, tfidf_vectorizer.transform([senator_text['biden']]))

# Create a DataFrame with the cosine similarities
similarity_df = pd.DataFrame({
    'senator': list(senator_text.keys()),
    'cosine_similarity': cosine_similarities.flatten()
})

# Print the DataFrame
similarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)
# Drop Biden
similarity_df = similarity_df.drop(similarity_df[similarity_df['senator'] == 'biden'].index)
# Print the DataFrame
similarity_df

The first step is to combine all the senator texts into a single list. Then, the TfidfVectorizer is used to vectorize the texts. Reminding you that TfidfVectorizer is a method that creates a document-term matrix with tf-idf scores. Tfidf stands for term frequency-inverse document frequency, and it is a numerical statistic that reflects how important a word is to a document in a collection of documents. In this case, the TfidfVectorizer is used to convert the text into a matrix where each row represents a senator and each column represents a word in the text.

After the texts are vectorized, the cosine similarity between each pair of texts is calculated. The result is a similarity score that ranges from 0 to 1, where 1 means that the two texts are identical.

The code then creates a data frame with the cosine similarities and sorts it in descending order. The senator with the highest similarity score to Biden’s speeches is the one with the highest cosine similarity value. Finally, the data frame is printed with the senator and the cosine similarity score.

This code can be useful to understand which senator’s speeches are similar to Biden’s speeches. By comparing the similarity score between each senator and Biden’s speeches, it is possible to identify which senator has the most similar speeches. The result can be used to understand which senator is ideologically aligned with Biden and can help to create alliances or form strategies for political campaigns.

The output shows that Bennet is the most similar senator to Biden with a cosine similarity value of 0.178762, followed by Ashcroft with 0.170275 and McCain with 0.144165. The least similar senator to Biden is Stevens with a cosine similarity value of 0.0000, followed by Murkowski with 0.004488, and Hutchison with 0.006821.

It is important to note that cosine similarity values range from -1 to 1, with 1 indicating perfect similarity and -1 indicating perfect dissimilarity. Therefore, the cosine similarity values between Biden and the senators in this output are relatively low, indicating that the similarity between their speeches is not very high.

Here is the overall scatter plot of speeches’ cosine similarity.

# merge the party alliance data with the initial datafram
merged_df = similarity_df.merge(parties, left_on='senator', right_on='lname').drop(['lname', 'cong', 'id', "dist"], axis=1)
merged_df
# Create the scatter plot with hue based on party affiliation
sns.scatterplot(data=merged_df, x="cosine_similarity", y="party", hue="party")
plt.savefig("scatter.png")

100 codes democrats and 200 republicans. We see that Biden’s colleagues’ speeches' similar values are less dispersed than those of republicans. It is quite surprising to see that the right tail of the republican scores “distribution” indicate a higher magnitude of similarity. I lack the domain knowledge to explain what the reason behind it might be. Nonetheless, it is an interesting observation.

Overall, the code provides a useful way to identify the senators whose speeches are most similar to Biden’s speeches, which could be useful for further analysis and understanding of political discourse.

I talk just like my friends

Data Acquisition

Text Preprocessing

TF-IDF Vectorization

Similarity measures

Finding Biden’s bestie

Written by Mukhamejan Assan