Creating a TF-IDF Model from Scratch in Python

Ashwin N
4 min readJul 26, 2022

--

Introduction to TF-IDF

TF-IDF is a method of information retrieval that is used to rank the importance of words in a document. It is based on the idea that words that appear in a document more often are more relevant to the document.

TF-IDF is the product of Term Frequency and Inverse Document Frequency. Here’s the formula for TF-IDF calculation.

TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)

1. What is Term Frequency?

It is the measure of the frequency of words in a document. It is the ratio of the number of times the word appears in a document compared to the total number of words in that document.

2. What is Inverse Document Frequency?

It is the measure of how much information the word provides about the topic of the document. It is the log of the ratio of the number of documents to the number of documents containing the word.

We take log of this ratio because when the corpus becomes large IDF values can get large causing it to explode hence taking log will dampen this effect. We cannot divide by 0, we smoothen the value by adding 1 to the denominator.

idf(t) = log(N/(df + 1))

Advantage of TF-IDF

This method removes the drawbacks faced by the Bag of Words model. It does not assign equal value to all the words, hence important words that occur a few times will be assigned high weights.

Bag of Words just creates a set of vectors containing the count of word occurrences in the document (reviews), while the TF-IDF model contains information on the more important words and the less important ones as well.

Step by Step Implementation of the TF-IDF Model

1. Preprocess the data

We will start with preprocessing the text data, and make a vocabulary set of the words in our training data and assign a unique index for each word in the set. But first let us consider below sample text.

sample_text = ['Topic sentences are similar to mini thesis statements. Like a thesis statement', 'a topic sentence has a specific main point. Whereas the thesis is the main point of the essay, the topic sentence is the main point of the paragraph.               Like the thesis statement, a topic sentence has a unifying function. 
But a thesis statement or topic sentence alone doesn’t guarantee unity.', 'An essay is unified if all the paragraphs relate to the thesis,'whereas a paragraph is unified if all the sentences relate to the topic sentence.']

Now tokenize the sample text and create unique set of words.

from nltk.tokenize import word_tokenizesentences = []
word_set = []

for sent in sample_text:
words = [word.lower() for word in word_tokenize(sent) if word.isalpha()]
sentences.append(words)
for word in words:
if word not in word_set:
word_set.append(word)
# Set of words
word_set = set(word_set)
# total documents in our corpus
total_docs = len(sample_text)
print('Total documents: ', total_docs)
print('Total words: ', len(word_set))
-----------------------------------------------------------
Total documents: 3
Total words: 35

Now let us index each word from vocabulary. This will be later used to map the word to the vector.

word_index = {}
for i, word in enumerate(word_set):
word_index[word] = i

Create a dictionary to keep the count of the number of documents containing the given word.

def count_dict(sentences):
count_dict = {}
for word in word_set:
count_dict[word] = 0
for sent in sentences:
for word in sent:
count_dict[word] += 1
return count_dict
word_count = count_dict(sentences)
print(word_count)
-----------------------------------------------------------
{'paragraphs': 1, 'topic': 6, 'thesis': 6, 'alone': 1, 'similar': 1, 'sentences': 2, 'if': 2, 'has': 2, 'of': 2, 'unity': 1, 'all': 2, 'main': 3, 'relate': 2, 'sentence': 5, 'paragraph': 1, 'doesn': 1, 'mini': 1, 'the': 11, 'unified': 2, 'statement': 3, 'specific': 1, 'to': 3, 'but': 1, 'or': 1, 'an': 1, 'a': 7, 'point': 3, 'unifying': 1, 'function': 1, 'whereas': 2, 'essay': 2, 'guarantee': 1, 'is': 4, 't': 1, 'are': 1}

Now let us calculate the Term Frequency (TF) of each word in the corpus

def term_frequency(document, word):
N = len(document)
occurance = len([token for token in document if token == word])
return occurance / N

For inverse document frequency (IDF) for each word in the document:

def inverse_document_frequency(word):
try:
word_occurance = word_count[word] + 1
except:
word_occurance = 1
return np.log(total_docs / word_occurance)

The last part is to combine both TF and IDF

def tf_idf(sentence):
vec = np.zeros((len(word_set),))
for word in sentence:
tf = term_frequency(sentence, word)
idf = inverse_document_frequency(word)
vec[word_index[word]] = tf * idf
return vec

Now let us apply TF-IDF to our sample texts.

vectors = []
for sent in sentences:
vectors.append(tf_idf(sent))

print(vectors)
-----------------------------------------------------------
[ 0.0144809 -0.01027436 -0.09078191 0. 0. 0.
0. -0.14853154 0.0144809 0. 0. 0.
0. 0.0144809 0. -0.06052128 0. 0.
-0.01027436 0.0144809 0. -0.02054872 -0.02475526 0.
0. 0. -0.10508885 0. 0. 0.
0. 0. -0.02054872 0. -0.01824377]

Now, if the model encounters an unknown word other than the vocab, it will give us a Key error as we did not account for any unknown tokens.

The purpose of this article is to demonstrate how TF-IDF actually works under the hood.

The implementation for above code is in github repository.

--

--

Ashwin N

Lead Data Scientist 🧙‍♂️ | Exploring the AI Wonderland 🔬 | Sharing Insights on Data Science 📊 | Join me in https://medium.com/@ashwinnaidu1991