Comparing Text Documents Using TF-IDF and Cosine Similarity in Python

Miftahul Ulyana Hutabarat
6 min readDec 17, 2023

--

Photo by Anna on Unsplash

How can we know if two text documents are similar? Humans can see the differences, but how can computers know if two text documents are exactly the same or not alike at all? Well, we can figure it out using TF-IDF and cosine similarity.

The Explanation of TF-IDF and Cosine Similarity

TF-IDF (Term Frequency-Inverse Document Frequency) is a method used to assess the importance of a word in a document or a collection of texts. The result of TF-IDF is obtained by multiplying Term Frequency (TF) and Inverse Document Frequency (IDF). Here is the formula:

in this case, Cosine Similarity is a method used to measure how similar two text documents are to each other. Its values range from 0 to 1, where the closer the value is to 1, the more similar the documents are. Here is the formula:

Steps for Using TF-IDF and Cosine Similarity

In natural language processing (NLP), pre-processing is the first step to clean and simplify text so that it can be processed more effectively by the next set of algorithms. The aim, in this case, is to make tf-idf and cosine similarity calculations work better, creating a more organized and relevant text representation. Here are the steps:

Here, I’ll explain each step of preprocessing one by one. I use these five steps depending on what’s needed, but some might choose to use only four or three steps. The order of preprocessing can also vary and be adjusted according to your needs.

  1. “Lowercasing: Changing all letters to lowercase for consistency and uniformity in analysis. For example, “Data Analysis” becomes “data analysis.”
  2. Cleaning: Removing irrelevant or disruptive characters in the text. For instance, “Hello! @World! :)” becomes “Hello World.”
  3. Tokenization: Breaking the text into separate tokens, where each word becomes a unit that can be processed. For example, “This process is very interesting.” becomes [“This”, “process”, “is”, “very”, “interesting”].
  4. Stopwords: Eliminating common words that don’t contribute significantly to the analysis. For instance, “This is a very interesting process.” becomes [“process”, “interesting”].
  5. Stemming: Changing words to their base form to treat words with the same root as the same entity. For example, “Interestingly, I am learning analysis.” becomes [“interest”, “I”, “am”, “learn”, “analysis”].”

After we clean up the text, we move on to calculate something called TF-IDF to see how important words are in a document. Then, we use cosine similarity to figure out how similar two documents are.

Implementation in Python

To run Python scripts, I use Jupyter Notebook (Anaconda3). Since the data I have is in CSV format, I use pandas to read it, as shown in the example below.

import multiprocessing as mp
import numpy as np
import pandas as pd
import nltk
nltk.download('stopwords', quiet=True)

data = pd.read_csv('data.csv', delimiter=';', encoding='latin')
data

Here is the output:

The step starts by implementing preprocessing, which involves 5 steps: lowercasing, cleaning, tokenization, stopword, and stemming. Here is the code:

def preprocess_text(text):
# lowercasing
lowercased_text = text.lower()

# cleaning
import re
remove_punctuation = re.sub(r'[^\w\s]', '', lowercased_text)
remove_white_space = remove_punctuation.strip()

# Tokenization = Breaking down each sentence into an array
from nltk.tokenize import word_tokenize
tokenized_text = word_tokenize(remove_white_space)

# Stop Words/filtering = Removing irrelevant words
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
stopwords_removed = [word for word in tokenized_text if word not in stopwords]

# Stemming = Transforming words into their base form
from nltk.stem import PorterStemmer
ps = PorterStemmer()
stemmed_text = [ps.stem(word) for word in stopwords_removed]

# Putting all the results into a dataframe.
df = pd.DataFrame({
'DOCUMENT': [text],
'LOWERCASE' : [lowercased_text],
'CLEANING': [remove_white_space],
'TOKENIZATION': [tokenized_text],
'STOP-WORDS': [stopwords_removed],
'STEMMING': [stemmed_text]
})

return df

def preprocessing(corpus):
# Create an empty DataFrame
df = pd.DataFrame(columns=['DOCUMENT'])

# Running preprocessing one by one
for doc in corpus['DOCUMENT']:
# Call the preprocess_text function
result_df = preprocess_text(doc)

# Concatenate the result of preprocessing to the main DataFrame
df = pd.concat([df, result_df], ignore_index=True)

return df

result_preprocessing = preprocessing(data)
result_preprocessing

Here is the output:

After preprocessing, we will proceed with the TF-IDF calculation. Here is the code:

def calculate_tfidf(corpus):
# Call the preprocessing result
df = preprocessing(corpus)

# Make each array row from stopwords_removed to be a sentence
stemming = corpus['STEMMING'].apply(' '.join)

# Count TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(stemming)

# Get words from stopwords array to use as headers
feature_names = vectorizer.get_feature_names_out()

# Combine header titles and weights
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
df_tfidf = pd.concat([df, df_tfidf], axis=1)

return df_tfidf

result_tfidf = calculate_tfidf(result_preprocessing)
result_tfidf

Here is the output:

Each word will be assigned a weight. In the TF process, a word is counted based on how many times it appears in a document and is given a weight. The more often the word appears, the higher its weight. In the IDF process, it is calculated how common the word is across all documents in the text collection and is given a weight. The more common the word is, the lower its weight.

For your information, in scikit-learn, to calculate TF-IDF, it is computed using the following formula.

Next, we will calculate cosine similarity, where index 0 is used as the reference for each document. Here is the code:

def cosineSimilarity(corpus):
# Call the TF-IDF result
df_tfidf = calculate_tfidf(corpus)

# Get the TF-IDF vector for the first item (index 0)
vector1 = df_tfidf.iloc[0, 6:].values.reshape(1, -1)

# Get the TF-IDF vector for all items except the first item
vectors = df_tfidf.iloc[:, 6:].values

# Calculate cosine similarity between the first item and all other items
from sklearn.metrics.pairwise import cosine_similarity
cosim = cosine_similarity(vector1, vectors)
cosim = pd.DataFrame(cosim)

# Convert the DataFrame into a one-dimensional array
cosim = cosim.values.flatten()

# Convert the cosine similarity result into a DataFrame
df_cosim = pd.DataFrame(cosim, columns=['COSIM'])

# Combine the TF-IDF array with the cosine similarity result
df_cosim = pd.concat([df_tfidf, df_cosim], axis=1)

return df_cosim

cosim_result = cosineSimilarity(result_tfidf)
cosim_result

Here is the output in the COSIM column:

The result shows that the cosine similarity (cosim) between the document with index 0 is more similar, at 0.104634, to document 3.

Thank you friends for reading. This case is part of my research for my final project at the university. I hope it can help you in further exploring TF-IDF and Cosine Similarity. This is for the github https://github.com/mifthulyn07/ComparingTextDocument-TfidfCosim.git

--

--

Miftahul Ulyana Hutabarat

Love coding and sharing what I've learned ❤😂. Check out my Medium for simple insights from my programming journey👼.