Understand TF-IDF by building it from scratch.

Gursewak Singh
Analytics Vidhya
Published in
4 min readMar 6, 2021
Photo by Brett Jordan on Unsplash

One of the basic and first algorithms that people learn in the machine learning domain is TF-IDF vectorization. TF-IDF stands for Term Frequency and Inverse Document frequency.

Here we will try to see how exactly TF-IDF works and will compare our results to the sklearn library. We will not go much into details about how it is useful as compared to other algorithms.

What is TF-IDF?

TF-IDF is a statistical tool that measures how relevant a word is to a document in a bunch of documents.

It consists of two parts :
1. Term Frequency (TF): how many times a word appears in a document
2. Inverse Document Frequency (IDF): Inverse document frequency of the word across a set of documents.

Let's take an example here and that we will carry throughout the blog.

import pandas as pd
corpus = [('Document 1', 'Alot of people like to play football'),
('Document 2', 'many like to eat'),
('Document 3', 'According to data, many like to sing')]
data = pd.DataFrame(corpus,columns=['Document Number','text of Documents'])
sample data for TF-IDF

What is Term Frequency(TF)?

For calculating the term frequency, let’s see if we can find unique words and their count in each of the documents.
There are various ways to calculate the term frequency:

  • Raw Count: tf(t,d) = f(t,d)
  • Boolean Frequency: tf(t,d) = 1 if t occurs in d and 0 otherwise.
  • Term frequency adjusted for document length: tf(t,d) = f(t,d) ÷ (number of words in d)
  • Logarithmically scaled frequency: tf(t,d) = log (1 + f(t,d))

Generally, Raw count or count vector is used for calculating the Term Frequency

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['text of Documents'])
cols = vectorizer.get_feature_names()
count = pd.DataFrame(X.toarray(), columns=cols)
Count of each word in the document.

What is inverse document Frequency?

Since Term frequency tends to give more weightage to common words, we need something to reduce this effect, since common words like “the”, “is”, “am” are not very useful words to determine the uniqueness of the document. If two documents are there, one on sports and one on a medical domain, words like “Football”, “Hypertension” will be rare and will help determine the documents better.

As Zipf’s law states that within a group or corpus of documents, the frequency of any word is inversely proportional to its rank in a frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

IDF is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

# df is the document frequency (how many dcouments word has occured)
import numpy as np
df = np.array(count.astype(bool).sum())
# df = np.array([1,1,1,1,1,3,2,1,1,1,1,3])
# No of documents
n_samples = len(data)
# n_samples = 3
smooth_idf = True
# smooth_idf is used to avoid divide by zero
df += int(smooth_idf)
n_samples += int(smooth_idf)
idf = np.log(n_samples / df) + 1
IDF for each word

Why we used log here, adding log is to dampen the importance of term that has a high frequency. More the IDF (before log) for a certain word, more importance is given to that word. Refer here for more details.

TF*IDF before normalization

This is calculated by just multiplying the TF and IDF ((3*12 ) X (12,))

df_before_normalization = count*idf
TF-IDF before normalization

TF*IDF after normalization
Since sklearn library also normalizes the results, we are also doing the same here. L2 normalization is done within each document

L2- Norm Formula
from math import sqrt
# For document 1
sqrt(pow(0.0,2)+pow(1.69315,2)+pow(0.0,2)+pow(0.0,2)+pow(1.69315,2)+pow(1.0,2)+pow(0.0,2)+pow(1.69315,2)+pow(1.69315,2)+pow(1.69315,2)+pow(0.0,2)+pow(1.0,2))=4.041507
document 1 = 0.0/4.041507, 1.69315/4.041507, 0.0/4.041507, 0.0/4.041507, 1.69315/4.041507, 1.0/4.041507, 0.0/4.041507, 1.69315/4.041507, 1.69315/4.041507, 1.69315/4.041507, 0.0/4.041507, 1.0/4.041507 =
0.0, 0.419, 0.0, 0.0, 0.419, 0.247, 0.0, 0.419, 0.419, 0.418, 0.0, 0.247
# same way do the same of Document 2 and Document 3# if you want to do this using python
# method 1
from sklearn.preprocessing import normalize
tf_idf = normalize(df_before_normalization, norm='l2', axis=1)
# method 2
from numpy.linalg import norm
for idx, row in df_before_normalization.iterrows():
print(row/norm(row))
TF-IDF after normalization

Sklearn Library:

Now, let's compare the same with the library version.

Code Snippet for TF-IDF

The output from the library:

Here we can see that the output is the same as we calculated manually as we have followed the same steps used in the library. I hope reading the blog might have given you some new information about the TD-IDF working, if so please appreciate it.

Usage of TF-IDF:

References:

--

--