Comparing Text Documents Using TF-IDF and Cosine Similarity in Python

6 min readDec 17, 2023

How can we know if two text documents are similar? Humans can see the differences, but how can computers know if two text documents are exactly the same or not alike at all? Well, we can figure it out using TF-IDF and cosine similarity.

The Explanation of TF-IDF and Cosine Similarity

TF-IDF (Term Frequency-Inverse Document Frequency) is a method used to assess the importance of a word in a document or a collection of texts. The result of TF-IDF is obtained by multiplying Term Frequency (TF) and Inverse Document Frequency (IDF). Here is the formula:

in this case, Cosine Similarity is a method used to measure how similar two text documents are to each other. Its values range from 0 to 1, where the closer the value is to 1, the more similar the documents are. Here is the formula:

Steps for Using TF-IDF and Cosine Similarity

In natural language processing (NLP), pre-processing is the first step to clean and simplify text so that it can be processed more effectively by the next set of algorithms. The aim, in this case, is to make tf-idf and cosine similarity calculations work better, creating a more organized and relevant text representation. Here are the steps:

Here, I’ll explain each step of preprocessing one by one. I use these five steps depending on what’s needed, but some might choose to use only four or three steps. The order of preprocessing can also vary and be adjusted according to your needs.

“Lowercasing: Changing all letters to lowercase for consistency and uniformity in analysis. For example, “Data Analysis” becomes “data analysis.”
Cleaning: Removing irrelevant or disruptive characters in the text. For instance, “Hello! @World! :)” becomes “Hello World.”
Tokenization: Breaking the text into separate tokens, where each word becomes a unit that can be processed. For example, “This process is very interesting.” becomes [“This”, “process”, “is”, “very”, “interesting”].
Stopwords: Eliminating common words that don’t contribute significantly to the analysis. For instance, “This is a very interesting process.” becomes [“process”, “interesting”].
Stemming: Changing words to their base form to treat words with the same root as the same entity. For example, “Interestingly, I am learning analysis.” becomes [“interest”, “I”, “am”, “learn”, “analysis”].”

After we clean up the text, we move on to calculate something called TF-IDF to see how important words are in a document. Then, we use cosine similarity to figure out how similar two documents are.

Implementation in Python

To run Python scripts, I use Jupyter Notebook (Anaconda3). Since the data I have is in CSV format, I use pandas to read it, as shown in the example below.

import multiprocessing as mp
import numpy as np
import pandas as pd
import nltk
nltk.download('stopwords', quiet=True)

data = pd.read_csv('data.csv', delimiter=';', encoding='latin')
data

Here is the output:

The step starts by implementing preprocessing, which involves 5 steps: lowercasing, cleaning, tokenization, stopword, and stemming. Here is the code:

def preprocess_text(text):
    # lowercasing
    lowercased_text = text.lower()

    # cleaning 
    import re 
    remove_punctuation = re.sub(r'[^\w\s]', '', lowercased_text)
    remove_white_space = remove_punctuation.strip()

    # Tokenization = Breaking down each sentence into an array
    from nltk.tokenize import word_tokenize
    tokenized_text = word_tokenize(remove_white_space)

    # Stop Words/filtering = Removing irrelevant words
    from nltk.corpus import stopwords
    stopwords = set(stopwords.words('english'))
    stopwords_removed = [word for word in tokenized_text if word not in stopwords]

    # Stemming = Transforming words into their base form
    from nltk.stem import PorterStemmer
    ps = PorterStemmer()
    stemmed_text = [ps.stem(word) for word in stopwords_removed]
    
    # Putting all the results into a dataframe.
    df = pd.DataFrame({
        'DOCUMENT': [text],
        'LOWERCASE' : [lowercased_text],
        'CLEANING': [remove_white_space],
        'TOKENIZATION': [tokenized_text],
        'STOP-WORDS': [stopwords_removed],
        'STEMMING': [stemmed_text]
    })

    return df

def preprocessing(corpus):
    # Create an empty DataFrame
    df = pd.DataFrame(columns=['DOCUMENT'])

    # Running preprocessing one by one
    for doc in corpus['DOCUMENT']:
        # Call the preprocess_text function
        result_df = preprocess_text(doc)
        
        # Concatenate the result of preprocessing to the main DataFrame
        df = pd.concat([df, result_df], ignore_index=True)
        
    return df

result_preprocessing = preprocessing(data)
result_preprocessing

Here is the output:

After preprocessing, we will proceed with the TF-IDF calculation. Here is the code:

def calculate_tfidf(corpus):
    # Call the preprocessing result
    df = preprocessing(corpus)
        
    # Make each array row from stopwords_removed to be a sentence
    stemming = corpus['STEMMING'].apply(' '.join)
    
    # Count TF-IDF
    from sklearn.feature_extraction.text import TfidfVectorizer
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(stemming)
    
    # Get words from stopwords array to use as headers
    feature_names = vectorizer.get_feature_names_out()

    # Combine header titles and weights
    df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
    df_tfidf = pd.concat([df, df_tfidf], axis=1)

    return df_tfidf

result_tfidf = calculate_tfidf(result_preprocessing)
result_tfidf

Here is the output:

Each word will be assigned a weight. In the TF process, a word is counted based on how many times it appears in a document and is given a weight. The more often the word appears, the higher its weight. In the IDF process, it is calculated how common the word is across all documents in the text collection and is given a weight. The more common the word is, the lower its weight.

For your information, in scikit-learn, to calculate TF-IDF, it is computed using the following formula.

Next, we will calculate cosine similarity, where index 0 is used as the reference for each document. Here is the code:

def cosineSimilarity(corpus):
    # Call the TF-IDF result
    df_tfidf = calculate_tfidf(corpus)
    
    # Get the TF-IDF vector for the first item (index 0)
    vector1 = df_tfidf.iloc[0, 6:].values.reshape(1, -1)

    # Get the TF-IDF vector for all items except the first item
    vectors = df_tfidf.iloc[:, 6:].values
    
    # Calculate cosine similarity between the first item and all other items
    from sklearn.metrics.pairwise import cosine_similarity
    cosim = cosine_similarity(vector1, vectors)
    cosim = pd.DataFrame(cosim)
    
    # Convert the DataFrame into a one-dimensional array
    cosim = cosim.values.flatten()

    # Convert the cosine similarity result into a DataFrame
    df_cosim = pd.DataFrame(cosim, columns=['COSIM'])

    # Combine the TF-IDF array with the cosine similarity result
    df_cosim = pd.concat([df_tfidf, df_cosim], axis=1)

    return df_cosim

cosim_result = cosineSimilarity(result_tfidf)
cosim_result

Here is the output in the COSIM column:

The result shows that the cosine similarity (cosim) between the document with index 0 is more similar, at 0.104634, to document 3.

Thank you friends for reading. This case is part of my research for my final project at the university. I hope it can help you in further exploring TF-IDF and Cosine Similarity. This is for the github https://github.com/mifthulyn07/ComparingTextDocument-TfidfCosim.git

Comparing Text Documents Using TF-IDF and Cosine Similarity in Python

Written by Miftahul Ulyana Hutabarat