TF-IDF in Depth Understanding

Implementation in Python

Annamalai Swamy
5 min readDec 14, 2023

TF-IDF (Term Frequency — Inverse Document Frequency) is a statistical measure for text mining, NLP, Machine Learning. It is a measure of importance of a word / term within a document relative to a collection of documents (a.k.s corpus)

TF-IDF is type of text vectorization process where a words or terms within a document are transformed into importance numbers. TF-IDF scores of a word is calculated by multiplying the words Term Frequency (TF) and Inverse Document Frequency (IDF).

TF (Term Frequency):

Term Frequency [tf(t,d)] is a relative frequency of a term(t) of interest within a document(d)

Term Frequency Formula

There are multiple ways of calculating term frequency:

  • Raw count -> Count of word appears in the document
  • Boolean Frequency -> tf(t,d) = 1 if word(t) occurs in document (d) else 0
  • Logarithmically Scaled -> tf(t,d) = log(1 + raw count)
  • Augmented frequency -> To prevent bias towards longer documents
  • Term frequency
  • Double Normalization

IDF (Inverse Document Frequency)

IDF [idf(t,D)] is a measure of how much information the term(t) provides across all documents(D)

Inverse Document Frequency Formula

There are multiple ways of calculating term frequency:

Libraries to calculate Tf-idf

sklearn library has inbuilt classes like Tfidfvectorizer, TfidfTransformer, CountVectorizer to calculate tfidf:

CountVectorizer — Converts a collection of text documents to a matrix of token counts

TfidfVectorizer — Convert a collection of raw documents to a matrix of TF-IDF features

TfidfTransformer — Transform a count matrix to a normalized tf-idf representation

TfidfVectorizer Vs. TfidfTransformer Vs. CountVectorizer

TfidfVectorizer is used on sentences / documents, whereas TfidfTransformer is used on existing count matrix such as one returned by CountVectorizer

With TfidfTransformer, word counts are first computed using CountVectorizer, and then Inverse Document Frequency (IDF) and Tf-idf sources are computed using TfidfTransformer

With TfidfVectorizer all three values are computed at once. This class computes the word count, IDF, Tf-idf scores in single call.

Implementation of CountVectorizer in Python

Steps:

  1. Make each of the words in corpus to lower
  2. Remove the stopwords in corpus
  3. Make a unique wordlist (union of words in particular document)
  4. Count the frequency of each of words occurrence in particular document
import numpy as np
import pandas as pd
# import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

def CountVectorizer_IL(corpus, stop_words='english'):
vector = pd.DataFrame().astype(int)
for line in corpus:
wordlist = set()
term_counter = dict()

# split each line to words
# make it lower
# remove the stopwords
# make the union of wordlist
wordlist = wordlist.union(set([word for word in line.lower().split()
if word not in ENGLISH_STOP_WORDS]))

# count the frequency of each word occurence
for word in wordlist:
if(word in term_counter.keys()):
term_counter[word] += line.count(word)
else:
term_counter[word] = line.count(word)
vector = vector._append(term_counter,ignore_index=True).fillna(0)
vector = vector.convert_dtypes()
vector = vector[sorted(vector.columns)]
return vector

corpus = ['this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first ']

cv = CountVectorizer(stop_words='english')
df_cv = pd.DataFrame(data=cv_terms.toarray(),
columns=cv.get_feature_names_out())
print("CountVectorizer")
display(df_cv)

print("Custom CountVectorizer")
display(CountVectorizer_IL(corpus))

Output:

Output of dataset using sklearn and custom CountVectorizer

Implementation of TfidVectorizer in Python

Steps

  1. Follow Steps mentioned in CountVectorizer for text preprocessing and building the unique wordlist
  2. Count the word frequency (raw count) of each word in the document
  3. Calculate the Inverse document frequency (IDF) using formula by either enabling / disabling smooth_idf, setting Normalization to ‘l1’, ‘l2’ or ‘none’
import numpy as np
import pandas as pd
# import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS

# split each line to words
# make it lower
# remove the stopwords
# make the union of wordlist
def TextPreProcessing(corpus,stop_words='english'):
# change the sentence to lower case
wordlist = set()
for line in corpus:
wordlist = wordlist.union(set([word for word in line.lower().split()
if word not in ENGLISH_STOP_WORDS]))
return wordlist

# count the frequency (raw count) of each word occurence
def calculateTF(corpus,wordlist,stop_words='english'):
vector = pd.DataFrame().astype(int)
for line in corpus:
term_counter = dict()
for word in wordlist:
if(word in term_counter.keys()):
term_counter[word] += line.count(word)
else:
term_counter[word] = line.count(word)
vector = vector._append(term_counter,ignore_index=True).fillna(0)
vector = vector.convert_dtypes()
vector = vector[sorted(vector.columns)]
return vector


# calculate IDF for each word using following :
# idf = log(1+ No. of documents)/(1 + '1' if word is present
# else '0') + 1

#smooth_idf -
# Purpose of enabling smooth_idf to true, is to add '1' to document
# frequencies, as if an extra document contains every term in collection
# exactly onces. It mainly prevents zero division

def calculateIDF(corpus,tf,wordlist,smooth_idf=True,norm=None):
vector = pd.DataFrame().astype(int)
term_counter = dict()
_idf = int(smooth_idf);
for word in wordlist:
term_counter[word] = np.log((_idf+len(corpus))/(len(tf[tf[word]>0]) + _idf)) + 1

vector = vector._append(term_counter,ignore_index=True).fillna(0)
vector = vector.convert_dtypes()
vector = vector[sorted(vector.columns)]
return vector

# tfidf = tf * idf
def calculateTFIDF(tf,idf):
return pd.DataFrame(tf.values * idf.values, columns=tf.columns)

def TfidfVectorizer_custom(corpus,smooth_idf=True,norm=None,stop_words='english'):
wordlist = TextPreProcessing(corpus,stop_words)
tf = calculateTF(corpus,wordlist,stop_words)
idf = calculateIDF(corpus,tf,wordlist,smooth_idf,norm)
tfidf = calculateTFIDF(tf,idf)
return tfidf


corpus = ['this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document']

vectorizer = TfidfVectorizer(smooth_idf=False,norm=None,stop_words='english')
vec_model = vectorizer.fit_transform(corpus)

print("TfidfVectorizer")
display(pd.DataFrame(vec_model.toarray(),columns=vectorizer.get_feature_names_out()))

print("Custom TfidfVectorizer")
display(TfidfVectorizer_custom(corpus,smooth_idf=False))

Output:

Output of dataset using sklearn and custom TfidfVectorizer

Implementation of TfidTransformer in Python

Steps

  1. Follow Steps mentioned in CountVectorizer for text preprocessing and building the unique wordlist
  2. Follow the steps mentioned for TfidfVectorizer by passing wordlist obtained from above step (step 1) to calculate the Inverse document frequency (IDF) using formula by either enabling / disabling smooth_idf, setting Normalization to ‘l1’, ‘l2’ or ‘none’

Please Note that TfidfVectorizer is equivalent to CountVectorizer followed by TfidfTransformer

import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

corpus = ['this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document']


cntvec = CountVectorizer(stop_words='english')
words = cntvec.fit_transform(corpus)

# print(cntvec.get_feature_names_out())
vectorizer = TfidfTransformer(smooth_idf=False,norm=None)
vec_model = vectorizer.fit_transform(words)

print("TfidfTransformer")
display(pd.DataFrame(vec_model.toarray(),columns=cntvec.get_feature_names_out()))

print("TfidfTransformer_custom")
tf = CountVectorizer_IL(corpus,stop_words='english')
idf = calculateIDF(corpus,tf,tf.columns,smooth_idf=False,norm=None)
display(calculateTFIDF(tf,idf))

Output:

Output of dataset using sklearn and custom TfidfTransformer

Conclusion:

I’ve explained the working details and concepts behind vectorizer and Transformer and how the Tf-idf calculation is done.

Reference:

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

--

--

Annamalai Swamy

Senior Architect with Expertise in Automation / AI & ML