An Introduction to TF-IDF

Roshan Kumar Gupta
Analytics Vidhya
Published in
3 min readNov 8, 2019

What is TF-IDF?

TF-IDF stands for “Term Frequency, Inverse Document Frequency.” It’s a way to score the importance of words (or “terms”) in a document based on how frequently they appear across multiple documents.

Intuitively…

  1. If a word appears frequently in a document, it’s important. Give the word a high score.
  2. But if a word appears in many documents, it’s not a unique identifier. Give the word a low score.

Therefore, common words like “the” and “for,” which appear in many documents, will be scaled down. Words that appear frequently in a single document will be scaled up.

Let’s learn what it means mathematically :

Term Frequency (tf): Gives us the frequency of the word in each document in the corpus. It is the ratio of number of times the word appears in a document compared to the total number of words in that document. It increases as the number of occurrences of that word within the document increases. Each document has its own tf.

Fig 1.1 : Formula to calculate term frequency

Inverse Data Frequency (idf): Used to calculate the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high IDF score. It is given by the equation below.

Fig 1.2 : Formula to calculate Inverse Data frequency

Combining these two we come up with the TF-IDF score (w) for a word in a document in the corpus. It is the product of tf and idf:

tf ᵢ,ⱼ=number of occurrences of i in j

tf ᵢ,ⱼ = number of occurrences of i in j

N = total number of documents

df ᵢ = number of documents containing i

Let’s take an example to get a clearer understanding.

Sentence 1 : The car is driven on the road.

Sentence 2: The truck is driven on the highway.

In this example, each sentence is a separate document.

We will now calculate the TF-IDF for the above two documents, which represent our corpus.

From the above table, we can see that TF-IDF of common words was zero, which shows they are not significant. On the other hand, the TF-IDF of “car” , “truck”, “road”, and “highway” are non-zero. These words have more significance.

Limitations of TF-IDF

1. It computes document similarity directly in the word-count space, which may be slow for large vocabularies.

2. It assumes that the counts of different words provide independent evidence of similarity.

3. It makes no use of semantic similarities between words.

Cheers!!

Thanks for reading the article. Be sure to share it if you find it helpful.

Also, Let’s get connected on Medium, GitHub and LinkedIn.

For any questions, you can reach out to me on email (roshankg96 [at] gmail [dot] com).

Happy Learning!!

--

--