Text similarity analysis on Speeches by US senators

Seng Moon Ja
9 min readFeb 16, 2023

--

Text similarity is to analyze how two words, phrases, documents, speeches are similar to one another. This is useful, for instance, to find common things mentioned in more than one article, speeches and news.

This article will illustrate following analysis:

  1. Testing similarity between two speeches by US senators namely Senator Babara Boxer representing California and Senator Edward Moore Kennedy representing Massachusetts.
  2. Comparing cosine similarity score among Bag-of-words, ngram model and TF-IDF.
  3. Comparing similarity by using Euclidean distance and Jaccard similarity

Firstly, to get text files from GitHub to analyze, I use the GitHub API and the “request” library in Python. This code uses the GitHub API to retrieve the contents of two text files in a repository, “105-boxer-ca.txt” and “105-kennedy-ma.txt”, located in the “Inputs/105-extracted-date/” directory. The API returns the contents of the file


import requests

# API endpoint for a file in a GitHub repository
url = "https://api.github.com/repos/{owner}/{repo}/contents/{path}"

# Replace {owner}, {repo}, and {path} with the repository's owner, name, and file path
file1_url = url.format(owner="ariedamuco", repo="ML-for-NLP", path="Inputs/105-extracted-date/105-boxer-ca.txt")
file2_url = url.format(owner="ariedamuco", repo="ML-for-NLP", path="Inputs/105-extracted-date/105-kennedy-ma.txt")

# Make the API request for file1
response1 = requests.get(file1_url)

# Check if the request was successful
if response1.status_code == 200:
# The API request was successful, parse the response
file1_info = response1.json()
speech1 = requests.get(file1_info["download_url"]).text
print("Speech 1:")
print(speech1)
else:
# The API request was unsuccessful
print("Failed to retrieve file 1 information")

# Make the API request for file2
response2 = requests.get(file2_url)

# Check if the request was successful
if response2.status_code == 200:
# The API request was successful, parse the response
file2_info = response2.json()
speech2 = requests.get(file2_info["download_url"]).text
print("Speech 2:")
print(speech2)
else:
# The API request was unsuccessful
print("Failed to retrieve file 2 information")

Into the defined empty list “documents”, the retrieved two speeches are added with variable name “speech1” and “speech2”. So that, we can analyse more easily.

documents = []
documents.append(speech1)
documents.append(speech2)

Import require toolkit from NLTK, TfidfVectorizer,stopwords, word_tokenize from respective packeages to analyze text similarity.

TfidfVectorizer is a tool for converting text documents into a matrix of term frequency-inverse document frequency (TF-IDF) features.Stopwords from the nltk.corpus module, which provides a list of stop words commonly used in English text (e.g., “the”, “a”, “and”). Word_tokenize is a function for splitting text into individual words. re, is Python's built-in regular expression module for pattern matching and text processing.

import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import re
from nltk.tokenize import word_tokenize
import pandas as pd

Before, comparing two speeches, text should be preprocessing with the preprocessing recipe such as

  • Remove capitalization, punctuation
  • Discard stop words
  • Discard Word Order (Bag of Words Assumption)
  • Create Equivalence Class: Stem, Lemmatize, or synonym Discard less useful features (depends on application)
  • Other reduction, specialization

To remove punctuation, import “string” to access punctuation which contain standard punctuation marks in English (e.g., “ ! ”, “ . ”, “ ? ”). This constant can be used to remove punctuation from a text string or to tokenize a string into words while ignoring punctuation.

import string
string.punctuation

Then, re (regular expression) module in Python to perform a substitution operation on the string text. The regular expression “\W” matches any character that is not a word character (i.e., letters, digits, and underscores), and the re.sub function replaces these characters with a space character.

Then, use word_tokenize which is a function in Python that splits a given sentence into words by using the NLTK library. Change the text into lower case and remove speaker name which appear in every paragraph such as [‘campbell’,’boxer’,’Kennedy’]. After that, stopwords are removed from the text. Stop words are common words like ‘the’, ‘and’, ‘I’, etc. that are very frequent in text, and so don’t convey insights into the specific topic of a document. To read more about stopwords check here.

def text_preprocesser(text):
text= re.sub(r'\W',' ', text)
tokens = word_tokenize(text)
tokens = [token for token in tokens if token not in ['DOC', 'DOCNO', 'TEXT','President','would','105']]
tokens = [token.lower() for token in tokens]
tokens = [token for token in tokens if token not in ['campbell','boxer','kennedy']]
tokens = [token for token in tokens if token not in stopwords.words('english')]
tokens = [word for word in tokens if len(word)>=3]
preprocessed_text = ' '.join(tokens)
return preprocessed_text

After removing and pre-processing the speeches, the TfidfVectorizer class is used to convert a collection of raw documents to a matrix of TF-IDF features. Calling the function text_preprocesser to perform neccessary text pre-processing operations indicating that min_df=2 which indicate the minimum number of documents a term must appear in to be included in the vocabulary. Then, the dataframe is trasform into array and transposes the data. The index is acquire from tfidf_vectorizer.get_feature_names_out() are used as the row index of the resulting DataFrame.

tfidf_vectorizer = TfidfVectorizer(preprocessor = text_preprocesser, min_df =2 )
tfidf = tfidf_vectorizer.fit_transform(documents)
# Create a DataFrame from the TF-IDF matrix
df = pd.DataFrame(tfidf.toarray().transpose(), index=tfidf_vectorizer.get_feature_names_out())

# Set the column names
df.columns = ['campbell', 'kennedy']

Before, we test text similarity, it is interesting to look at which words are more frequently used by both speakers.

Most frequent word from both speeches

To check similarity, three similarity analysis methods commonly used in Natural Language Processing (NLP) are

  1. Cosine Similarity: It measures the cosine of the angle between two vectors in a high-dimensional space. Cosine similarity ranges from -1 to 1, with 1 indicating that the two vectors are exactly the same, and -1 indicating they are completely dissimilar.
  2. Jaccard Similarity: It measures the similarity between two sets by calculating the ratio of the intersection of the sets to their union. Jaccard similarity ranges from 0 to 1, with 1 indicating that the two sets are exactly the same.
  3. Euclidean Distance: Euclidean distance measures the distance between two points in a multi-dimensional space. Euclidean distance ranges from 0 to infinity, with 0 indicating that the two vectors are exactly the same and infinity indicating that they are completely dissimilar.

Firstly, get the cosine_similarity between two speeches after TfidfVectorizer.

from sklearn.metrics.pairwise import cosine_similarity

# Calculate the cosine similarity between the two documents
cos_sim_TFID = cosine_similarity(tfidf[0], tfidf[1])
print(cos_sim_TFID)

Then, it is interesting to look at if the cosine_similarity between TfidfVectorizer, Bag of Words and N-grams.

#Bag of words approach

from sklearn.feature_extraction.text import CountVectorizer

# Preprocess the documents
processed_speech1 = text_preprocesser(documents[0])
processed_speech2 = text_preprocesser(documents[1])

# Create a CountVectorizer object and fit it to the documents
count_vectorizer = CountVectorizer()
count_matrix = count_vectorizer.fit_transform([processed_speech1, processed_speech2])

# Calculate the cosine similarity between the two documents
cos_sim_bag = cosine_similarity(count_matrix[0], count_matrix[1])



#N-gram approach

# Create a CountVectorizer object with bi-grams and fit it to the documents
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
count_matrix = count_vectorizer.fit_transform([processed_speech1, processed_speech2])

# Calculate the cosine similarity between the two documents
cos_sim_ng = cosine_similarity(count_matrix[0], count_matrix[1])

After the running this code, the results show as follow:

Comparing Tiff, Bag of words, and N-grams

TF-IDF Vectorizer: calculates the importance of each word in a document based on how frequently it appears in the document and how often it appears across all documents in the corpus.The cosine similarity of TFidFi higer than Bag of words and N-grams.

Bag of Words: builds a vocabulary of all words in the corpus and then counts the number of occurrences of each word in each document. It gives cosine similarity almost the same with TF-IDF but a bit lower in similarity.

N-gram CountVectorizer: count the frequency of contiguous sequences of n words, in each document.
It can help capture information about word order and context.N-grams is useful for sentimental analysis of the text.

The results suggests that the TF-IDF method may be slightly more effective at capturing the important words in the documents.

Then, it is interesting to see the results between three similarity method such as Cosine Similarity, Jaccard Similarity and Euclidean Distance

# Jaccard Similarity  between two documents using the TF-IDF Vectorizer:

from sklearn.feature_extraction.text import CountVectorizer


# Create a CountVectorizer object and fit it to the documents
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform([processed_sp1, processed_sp2])

# Calculate the Jaccard similarity between the two documents
doc1 = set(count_matrix[0].indices)
doc2 = set(count_matrix[1].indices)
jaccard_sim = len(doc1.intersection(doc2)) / len(doc1.union(doc2))



#Euclidean Distance similarity between two documents using the TF-IDF Vectorizer:
import math
from sklearn.metrics.pairwise import euclidean_distances


# Create a TfidfVectorizer object and fit it to the documents
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform([processed_sp1, processed_sp2])

# Calculate the Euclidean Distance similarity between the two documents
euclidean_sim = euclidean_distances(tfidf_matrix[0], tfidf_matrix[1])

The following results suggest that while the documents are similar in terms of their word usage, different similarity measuring method, the similarity of the documents differ.

Similarity score base on measuring method

Overall, cosine similarity using TF-IDF seems to be the most effective technique among the four. It gives the highest similarity score of 0.83040142, which means the two documents are highly similar. Cosine similarity using Bag of Words comes in second with a score of 0.82670145. Cosine similarity using N-grams gives a slightly lower score of 0.81094346, indicating that this technique may not be as effective as the previous two.

The Jaccard similarity using TF-IDF technique seems to be the least effective among the four with a score of 0.0752673413613115. This low score indicates that the Jaccard similarity technique may not be the best choice for measuring the similarity between two text documents.

The Euclidean distance similarity using TF-IDF technique gives a score of 0.59476808. This technique is different from the previous three in that it measures the distance between the two vectors in n-dimensional space. A lower score indicates a closer distance between the two vectors, which means they are more similar. In this case, the score suggests that the two documents are somewhat similar but not highly similar.

In conclusion, the choice of similarity technique depends on the specific use case and the nature of the text data. However, cosine similarity using TF-IDF is a generally effective technique for measuring similarity between text documents.

Which Senator’s speech is most similar to Senator Biden’s speech?

To explore this I download the 100 speeches folder from here. And ran this code to find 10 most similar speeches.

import os

# Set the path to the folder containing the text files
folder_path = '../../Machine Learning /ML-for-NLP-main/Inputs/105-extracted-date'

# Get a list of all the text files in the folder
file_list = [os.path.join(folder_path, file) for file in os.listdir(folder_path) if file.endswith(".txt")]

# Initialize an empty list to hold the speech documents
speech_doc = []

# Loop through each file in the file list
for file in file_list:
with open(file, 'r',encoding='ISO-8859-1') as f:
# Read the contents of the file and add it to the speech_doc list
speech_doc.append(f.read())

tfidf_vectorizer = TfidfVectorizer(preprocessor=text_preprocesser)

# Fit and transform the vectorizer to the speeches
tfidf = tfidf_vectorizer.fit_transform(speech_doc)

# Index of Biden's speech in the speech_doc array
biden_index = 6

# Calculate cosine similarity between Biden's speech and all the other speeches
cosine_similarities = cosine_similarity(tfidf[biden_index], tfidf)

# Get the index of the speech with the highest cosine similarity
most_similar_index = cosine_similarities.argsort()[0][-2]

# Get the name of the speech with the highest cosine similarity
speech_name = os.path.basename(file_list[most_similar_index])

# Print the name of the speech with the highest cosine similarity
print(f"The speech most similar to Senator Biden's speech is: {speech_name}")

# Index of Biden's speech in the speech_doc array
biden_index = 6

# Calculate cosine similarity between Biden's speech and all the other speeches
cosine_similarities = cosine_similarity(tfidf[biden_index], tfidf)

# Get the top 10 most similar speeches and their cosine similarity scores
most_similar_indices = cosine_similarities.argsort()[0][::-1][1:11]
most_similar_scores = cosine_similarities[0][most_similar_indices]

# Print the top 10 speeches and their cosine similarity scores
print("Top 10 most similar speeches:")
for i in range(len(most_similar_indices)):
speech_index = most_similar_indices[i]
speech_name = os.path.basename(file_list[speech_index])
similarity_score = most_similar_scores[i]
print(f"{i+1}. {speech_name} (Cosine similarity score: {similarity_score:.2f})")

The result shows

Speech Similarity

Conclusion

Overall, this analysis provides insights into the similarity of senator speeches and how NLP techniques can be used to analyze large volumes of text data. The findings may be useful in various applications, such as identifying potential plagiarism or detecting speech patterns of politicians.

--

--