TF-IDF vectorizer from scratch

Ponshriharini
featurepreneur
Published in
2 min readMar 12, 2022

TF-IDF is an abbreviation for Term Frequency Inverse Document Frequency. This is very common algorithm to transform text into a meaningful representation of numbers which is used to fit machine algorithm for prediction.

Let’s take a look at how tfidf vectorizer works from scratch.

First, we’ll define a corpus. Corpus refers to a set of documents. We’ll take two documents for this example.

doc1 = "I love hamburgers and cheese"
doc2 = "I love to make hamburgers"

Now, we’ll get the number of times each of these words appear in their respective documents.

After getting the term frequency, we’ll get the Inverse Document Frequency using the following formula

here,

n refers to the total number of documents.

d(t) is the number of documents in the document set that contain term.

Using this formula, we get the following values,

After this, term frequencies and IDFs are multiplied and normalized.

After multiplying, we get,

We get the normalization value for this document as 5.7663813.

Now, we’ll be dividing individual values from the table with this normalization value to normalize the data.

From this table, you can see that the words ‘and’ and ‘cheese’ are the most important features of doc1. This result might not be accurate as we’ve used only 2 documents to get the important features. If we use documents containing wide variety of data, then we’ll be able to get accurate results.

Happy coding !

--

--