TFIDF custom implementation

3 min readJul 4, 2023

What is TFIDF ?

We have previously written an exhaustive blog explaining the TF-IDF concept. Please go though the blog placed at https://www.technoskool.com/post/term-frequency-inverse-document-frequency-tf-idf before going through the python implementation.

How to implement ?

TF-IDF consist of two terms :

Term Frequency : It measures how frequently a particular term occurs in a document. Every document have different lengths, thus a term may be occur multiple times in longer document, thus as a measure of normalization we divide the frequency of term by the length of the document.

TF = Number of times a word appear in a document / Number of words in a document.

Inverse Document Frequency : It measure how important a term is. It weigh down the frequent term depending upon the corpus of documents and at the same time it scale up in case of rare words.

IDF = log(total number of documents / total number of documents containing the word)

Sklearn has done some processing to this formula to make it more robust.

Vocabulary generated from IDF is in sorted order
IDF = 1 + log(total number of documents +1 / total number of documents containing the word + 1)
L2-Normalization is applied on the top of the output
Final output is a sparse matrix

Implementation :

corpus = [  'this is the first document',  'this document is the second document',  'and this is the third one',  'is this the first document', ]

from collections import Counter
from tqdm import tqdm
from scipy.sparse import csr_matrix
import math
import operator
from sklearn.preprocessing import normalize
import numpy as np

#it is the fit function which takes the corpus and return idf dictionary
def fit(corpus):
    
    #words is a set which contains all the unique values
    words=set()
    
    #travelling all the documents in the corpus and adding the individual 
    #words to set
    for row in corpus:
        #splitting the documets to get all the individual rows.
        for word in row.split(" "):
            words.add(word)
    #sorting the list of words
    sorted_words=sorted(list(words))
    
    #defining the idf dict
    idf_dict={}
    
    #counting the number of documnets in corpus which contain the     
    #particular words
    for word in sorted_words:
        count=0
        for row in corpus:
            if word in row.split(" "):
                count=count+1
        #calculating idf accoring to the formula mentioned above        
        idf=1+np.log((1+len(corpus))/(1+count))
        #adding idf to dictionary
        idf_dict[word]=idf        
    #returning idf_dict
    return idf_dict

def transorm(corpus, idf_dict):
    #defining 3 list which will contain rows and columns and values for 
    #sparse matrix
    rows=[]
    columns=[]
    values=[]
    #travelling the corpus's documents  one by one 
    for index,row in enumerate(tqdm(corpus)):
        # it will return a dict type object where key is the word and 
        #values is its frequency, {word:frequency}
        word_frequency=dict(Counter(row.split()))
        # for every unique word in the document
        for word,frequency in word_frequency.items():
            #getting idf of the word
            word_idf=idf_dict[word]
            #getting column index to create sparse matrix
            column_index=list(idf_dict.keys()).index(word)
            #calcultaing term frequency 
            tf_idf=word_idf*(frequency/len(word_frequency))
            rows.append(index)
            columns.append(column_index)
            values.append(tf_idf)
            
    #creating csr_matrix        
    matrix=csr_matrix((values, (rows,columns)), shape= (len(corpus),len(idf_dict)))         
    #normalizing csr_matrix
    matrix= normalize(matrix, copy=False)  
    
    return  matrix

corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]
#calling fit function
idf_dict=fit(corpus)
#calling transform function
cust_ouput=transorm(corpus,idf_dict)
print(cust_ouput[0])

Conclusion :

In this post we have implemented the TFIDF function in python which resembles the one implemented in scikit learn. Please tune in for the next blog where we will learn how to implement max-feature functionality of TFIDF sklearn implementation.

TFIDF custom implementation

Written by Rajneesh Jha