TF-IDF in Depth Understanding

Implementation in Python

5 min readDec 14, 2023

TF-IDF (Term Frequency — Inverse Document Frequency) is a statistical measure for text mining, NLP, Machine Learning. It is a measure of importance of a word / term within a document relative to a collection of documents (a.k.s corpus)

TF-IDF is type of text vectorization process where a words or terms within a document are transformed into importance numbers. TF-IDF scores of a word is calculated by multiplying the words Term Frequency (TF) and Inverse Document Frequency (IDF).

TF (Term Frequency):

Term Frequency [tf(t,d)] is a relative frequency of a term(t) of interest within a document(d)

There are multiple ways of calculating term frequency:

Raw count -> Count of word appears in the document
Boolean Frequency -> tf(t,d) = 1 if word(t) occurs in document (d) else 0
Logarithmically Scaled -> tf(t,d) = log(1 + raw count)
Augmented frequency -> To prevent bias towards longer documents
Term frequency
Double Normalization

IDF (Inverse Document Frequency)

IDF [idf(t,D)] is a measure of how much information the term(t) provides across all documents(D)

There are multiple ways of calculating term frequency:

Libraries to calculate Tf-idf

sklearn library has inbuilt classes like Tfidfvectorizer, TfidfTransformer, CountVectorizer to calculate tfidf:

CountVectorizer — Converts a collection of text documents to a matrix of token counts

TfidfVectorizer — Convert a collection of raw documents to a matrix of TF-IDF features

TfidfTransformer — Transform a count matrix to a normalized tf-idf representation

TfidfVectorizer Vs. TfidfTransformer Vs. CountVectorizer

TfidfVectorizer is used on sentences / documents, whereas TfidfTransformer is used on existing count matrix such as one returned by CountVectorizer

With TfidfTransformer, word counts are first computed using CountVectorizer, and then Inverse Document Frequency (IDF) and Tf-idf sources are computed using TfidfTransformer

With TfidfVectorizer all three values are computed at once. This class computes the word count, IDF, Tf-idf scores in single call.

Implementation of CountVectorizer in Python

Steps:

Make each of the words in corpus to lower
Remove the stopwords in corpus
Make a unique wordlist (union of words in particular document)
Count the frequency of each of words occurrence in particular document

import numpy as np
import pandas as pd
# import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

def CountVectorizer_IL(corpus, stop_words='english'):
    vector = pd.DataFrame().astype(int)
    for line in corpus:
        wordlist = set()
        term_counter = dict() 

        # split each line to words
        # make it lower 
        # remove the stopwords
        # make the union of wordlist
        wordlist = wordlist.union(set([word for word in line.lower().split()  
                                        if word not in ENGLISH_STOP_WORDS])) 

        # count the frequency of each word occurence       
        for word in wordlist:
            if(word in term_counter.keys()):
                term_counter[word] += line.count(word)
            else:                
                term_counter[word] = line.count(word)
        vector = vector._append(term_counter,ignore_index=True).fillna(0)
        vector = vector.convert_dtypes() 
        vector = vector[sorted(vector.columns)]    
    return vector    

corpus = ['this is the first document',
          'this document is the second document',
          'and this is the third one',
          'is this the first ']

cv = CountVectorizer(stop_words='english')
df_cv = pd.DataFrame(data=cv_terms.toarray(),
                     columns=cv.get_feature_names_out())
print("CountVectorizer")
display(df_cv)

print("Custom CountVectorizer")
display(CountVectorizer_IL(corpus))

Output:

Implementation of TfidVectorizer in Python

Steps

Follow Steps mentioned in CountVectorizer for text preprocessing and building the unique wordlist
Count the word frequency (raw count) of each word in the document
Calculate the Inverse document frequency (IDF) using formula by either enabling / disabling smooth_idf, setting Normalization to ‘l1’, ‘l2’ or ‘none’

import numpy as np
import pandas as pd
# import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS

# split each line to words
# make it lower 
# remove the stopwords
# make the union of wordlist
def TextPreProcessing(corpus,stop_words='english'):
    # change the sentence to lower case
    wordlist = set() 
    for line in corpus:                
        wordlist = wordlist.union(set([word for word in line.lower().split()  
                                       if word not in ENGLISH_STOP_WORDS])) 
    return wordlist 

# count the frequency (raw count) of each word occurence
def calculateTF(corpus,wordlist,stop_words='english'):
    vector = pd.DataFrame().astype(int)
    for line in corpus:        
        term_counter = dict()            
        for word in wordlist:
            if(word in term_counter.keys()):
                term_counter[word] += line.count(word)
            else:                
                term_counter[word] = line.count(word)
        vector = vector._append(term_counter,ignore_index=True).fillna(0)
        vector = vector.convert_dtypes() 
        vector = vector[sorted(vector.columns)]    
    return vector    


# calculate IDF for each word using following :    
#     idf = log(1+ No. of documents)/(1 + '1' if word is present 
#                                    else '0') + 1

#smooth_idf - 
#      Purpose of enabling smooth_idf to true, is to add '1' to document
# frequencies, as if an extra document contains every term in collection 
# exactly onces. It mainly prevents zero division

def calculateIDF(corpus,tf,wordlist,smooth_idf=True,norm=None):
    vector = pd.DataFrame().astype(int)
    term_counter = dict()
    _idf = int(smooth_idf);
    for word in wordlist:               
        term_counter[word] = np.log((_idf+len(corpus))/(len(tf[tf[word]>0]) + _idf)) + 1  
        
    vector = vector._append(term_counter,ignore_index=True).fillna(0)
    vector = vector.convert_dtypes() 
    vector = vector[sorted(vector.columns)]          
    return vector       

# tfidf = tf * idf
def calculateTFIDF(tf,idf):
    return pd.DataFrame(tf.values * idf.values, columns=tf.columns)

def TfidfVectorizer_custom(corpus,smooth_idf=True,norm=None,stop_words='english'):
    wordlist = TextPreProcessing(corpus,stop_words)
    tf = calculateTF(corpus,wordlist,stop_words)
    idf = calculateIDF(corpus,tf,wordlist,smooth_idf,norm)    
    tfidf = calculateTFIDF(tf,idf)
    return tfidf


corpus = ['this is the first document',
          'this document is the second document',
          'and this is the third one',
          'is this the first document']

vectorizer = TfidfVectorizer(smooth_idf=False,norm=None,stop_words='english')
vec_model = vectorizer.fit_transform(corpus)

print("TfidfVectorizer")
display(pd.DataFrame(vec_model.toarray(),columns=vectorizer.get_feature_names_out()))

print("Custom TfidfVectorizer")
display(TfidfVectorizer_custom(corpus,smooth_idf=False))

Output:

Implementation of TfidTransformer in Python

Steps

Follow Steps mentioned in CountVectorizer for text preprocessing and building the unique wordlist
Follow the steps mentioned for TfidfVectorizer by passing wordlist obtained from above step (step 1) to calculate the Inverse document frequency (IDF) using formula by either enabling / disabling smooth_idf, setting Normalization to ‘l1’, ‘l2’ or ‘none’

Please Note that TfidfVectorizer is equivalent to CountVectorizer followed by TfidfTransformer

import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

corpus = ['this is the first document',
          'this document is the second document',
          'and this is the third one',
          'is this the first document']


cntvec = CountVectorizer(stop_words='english')
words = cntvec.fit_transform(corpus)

# print(cntvec.get_feature_names_out())
vectorizer = TfidfTransformer(smooth_idf=False,norm=None)
vec_model =  vectorizer.fit_transform(words)

print("TfidfTransformer")
display(pd.DataFrame(vec_model.toarray(),columns=cntvec.get_feature_names_out()))

print("TfidfTransformer_custom")
tf = CountVectorizer_IL(corpus,stop_words='english')
idf = calculateIDF(corpus,tf,tf.columns,smooth_idf=False,norm=None)
display(calculateTFIDF(tf,idf))