TF-IDF in Depth Understanding
Implementation in Python
TF-IDF (Term Frequency — Inverse Document Frequency) is a statistical measure for text mining, NLP, Machine Learning. It is a measure of importance of a word / term within a document relative to a collection of documents (a.k.s corpus)
TF-IDF is type of text vectorization process where a words or terms within a document are transformed into importance numbers. TF-IDF scores of a word is calculated by multiplying the words Term Frequency (TF) and Inverse Document Frequency (IDF).
TF (Term Frequency):
Term Frequency [tf(t,d)] is a relative frequency of a term(t) of interest within a document(d)
There are multiple ways of calculating term frequency:
- Raw count -> Count of word appears in the document
- Boolean Frequency -> tf(t,d) = 1 if word(t) occurs in document (d) else 0
- Logarithmically Scaled -> tf(t,d) = log(1 + raw count)
- Augmented frequency -> To prevent bias towards longer documents
- Term frequency
- Double Normalization
IDF (Inverse Document Frequency)
IDF [idf(t,D)] is a measure of how much information the term(t) provides across all documents(D)
There are multiple ways of calculating term frequency:
Libraries to calculate Tf-idf
sklearn library has inbuilt classes like Tfidfvectorizer, TfidfTransformer, CountVectorizer to calculate tfidf:
CountVectorizer — Converts a collection of text documents to a matrix of token counts
TfidfVectorizer — Convert a collection of raw documents to a matrix of TF-IDF features
TfidfTransformer — Transform a count matrix to a normalized tf-idf representation
TfidfVectorizer Vs. TfidfTransformer Vs. CountVectorizer
TfidfVectorizer is used on sentences / documents, whereas TfidfTransformer is used on existing count matrix such as one returned by CountVectorizer
With TfidfTransformer, word counts are first computed using CountVectorizer, and then Inverse Document Frequency (IDF) and Tf-idf sources are computed using TfidfTransformer
With TfidfVectorizer all three values are computed at once. This class computes the word count, IDF, Tf-idf scores in single call.
Implementation of CountVectorizer in Python
Steps:
- Make each of the words in corpus to lower
- Remove the stopwords in corpus
- Make a unique wordlist (union of words in particular document)
- Count the frequency of each of words occurrence in particular document
import numpy as np
import pandas as pd
# import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
def CountVectorizer_IL(corpus, stop_words='english'):
vector = pd.DataFrame().astype(int)
for line in corpus:
wordlist = set()
term_counter = dict()
# split each line to words
# make it lower
# remove the stopwords
# make the union of wordlist
wordlist = wordlist.union(set([word for word in line.lower().split()
if word not in ENGLISH_STOP_WORDS]))
# count the frequency of each word occurence
for word in wordlist:
if(word in term_counter.keys()):
term_counter[word] += line.count(word)
else:
term_counter[word] = line.count(word)
vector = vector._append(term_counter,ignore_index=True).fillna(0)
vector = vector.convert_dtypes()
vector = vector[sorted(vector.columns)]
return vector
corpus = ['this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first ']
cv = CountVectorizer(stop_words='english')
df_cv = pd.DataFrame(data=cv_terms.toarray(),
columns=cv.get_feature_names_out())
print("CountVectorizer")
display(df_cv)
print("Custom CountVectorizer")
display(CountVectorizer_IL(corpus))
Output:
Implementation of TfidVectorizer in Python
Steps
- Follow Steps mentioned in CountVectorizer for text preprocessing and building the unique wordlist
- Count the word frequency (raw count) of each word in the document
- Calculate the Inverse document frequency (IDF) using formula by either enabling / disabling smooth_idf, setting Normalization to ‘l1’, ‘l2’ or ‘none’
import numpy as np
import pandas as pd
# import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
# split each line to words
# make it lower
# remove the stopwords
# make the union of wordlist
def TextPreProcessing(corpus,stop_words='english'):
# change the sentence to lower case
wordlist = set()
for line in corpus:
wordlist = wordlist.union(set([word for word in line.lower().split()
if word not in ENGLISH_STOP_WORDS]))
return wordlist
# count the frequency (raw count) of each word occurence
def calculateTF(corpus,wordlist,stop_words='english'):
vector = pd.DataFrame().astype(int)
for line in corpus:
term_counter = dict()
for word in wordlist:
if(word in term_counter.keys()):
term_counter[word] += line.count(word)
else:
term_counter[word] = line.count(word)
vector = vector._append(term_counter,ignore_index=True).fillna(0)
vector = vector.convert_dtypes()
vector = vector[sorted(vector.columns)]
return vector
# calculate IDF for each word using following :
# idf = log(1+ No. of documents)/(1 + '1' if word is present
# else '0') + 1
#smooth_idf -
# Purpose of enabling smooth_idf to true, is to add '1' to document
# frequencies, as if an extra document contains every term in collection
# exactly onces. It mainly prevents zero division
def calculateIDF(corpus,tf,wordlist,smooth_idf=True,norm=None):
vector = pd.DataFrame().astype(int)
term_counter = dict()
_idf = int(smooth_idf);
for word in wordlist:
term_counter[word] = np.log((_idf+len(corpus))/(len(tf[tf[word]>0]) + _idf)) + 1
vector = vector._append(term_counter,ignore_index=True).fillna(0)
vector = vector.convert_dtypes()
vector = vector[sorted(vector.columns)]
return vector
# tfidf = tf * idf
def calculateTFIDF(tf,idf):
return pd.DataFrame(tf.values * idf.values, columns=tf.columns)
def TfidfVectorizer_custom(corpus,smooth_idf=True,norm=None,stop_words='english'):
wordlist = TextPreProcessing(corpus,stop_words)
tf = calculateTF(corpus,wordlist,stop_words)
idf = calculateIDF(corpus,tf,wordlist,smooth_idf,norm)
tfidf = calculateTFIDF(tf,idf)
return tfidf
corpus = ['this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document']
vectorizer = TfidfVectorizer(smooth_idf=False,norm=None,stop_words='english')
vec_model = vectorizer.fit_transform(corpus)
print("TfidfVectorizer")
display(pd.DataFrame(vec_model.toarray(),columns=vectorizer.get_feature_names_out()))
print("Custom TfidfVectorizer")
display(TfidfVectorizer_custom(corpus,smooth_idf=False))
Output:
Implementation of TfidTransformer in Python
Steps
- Follow Steps mentioned in CountVectorizer for text preprocessing and building the unique wordlist
- Follow the steps mentioned for TfidfVectorizer by passing wordlist obtained from above step (step 1) to calculate the Inverse document frequency (IDF) using formula by either enabling / disabling smooth_idf, setting Normalization to ‘l1’, ‘l2’ or ‘none’
Please Note that TfidfVectorizer is equivalent to CountVectorizer followed by TfidfTransformer
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
corpus = ['this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document']
cntvec = CountVectorizer(stop_words='english')
words = cntvec.fit_transform(corpus)
# print(cntvec.get_feature_names_out())
vectorizer = TfidfTransformer(smooth_idf=False,norm=None)
vec_model = vectorizer.fit_transform(words)
print("TfidfTransformer")
display(pd.DataFrame(vec_model.toarray(),columns=cntvec.get_feature_names_out()))
print("TfidfTransformer_custom")
tf = CountVectorizer_IL(corpus,stop_words='english')
idf = calculateIDF(corpus,tf,tf.columns,smooth_idf=False,norm=None)
display(calculateTFIDF(tf,idf))
Output:
Conclusion:
I’ve explained the working details and concepts behind vectorizer and Transformer and how the Tf-idf calculation is done.