Term Frequency (TF) and Inverse Document Frequency(IDF)

Charanraj Shetty
Analytics Vidhya
Published in
4 min readSep 6, 2020
https://pixabay.com/illustrations/wordcloud-tagcloud-cloud-text-tag-679949/

Term Frequency (TF) and Inverse Document Frequency(IDF) are the two terms which is commonly observe in Natural Language Processing techniques. It is used to find the word occurences and their contribution or impact or rather we can say importance in any given sentence of a document. This techniques are more often used in sentiment classification . The retrival of information in the form of emotions from the given word is more easier when a machine knows the significance of a word. The classification of positive and negative messages conveyed from any given sentence is generally taken care of by the above techniques. We will be following few steps in order to understand the concept in a better ways.

Suppose we are given a huge document given below which has many sentences and want to perform text classification and conclude using the TF and IDF techniques that what is the emotion or message that is conveyed through the below sentences.

Today morning the teams began their practice session. The boys Kabaddi team has gone through 1 round of practice. The boys football team has started practice.The boys cricket team has been doing the practice. The girls volleyball team is ready.The boys relay race team is up .

Step1 : Convert the sentences into bag of words

https://www.123rf.com/photo_18625412_shopping-words-shape-of-shopping-bag.html

This is the process of removing the stopwords like (is,are,they,them etc) which represent the pronoun or the words whose presence hardly contribute in classifying the meaning of the sentences. The next thing which we do is to perform the stemming operation on the given words which means coverting the words(which are in noun,verb ,adjective forms) to their base or root form. For e.g. consider the word training is getting converted into train verb that’s the base form. Now all these set of words which remain after performing the above cleaning process are collected in a list which represents the bag of words.

Bag_of_words=[‘team’ , ’boys’ , ’girls’ , ’training’ , ’kabaddi’ , ’football’ , ’cricket’ , ’volleyball’ , ’practice’ , ’round’ , ’relay’ , ’race’ , ’session’ , ‘today’ , ‘begin’ , ’go’ , ‘1’ , ‘start’ , ’ready’ , ]

Step 2 : Select Top frequency words

In the above given bag of words we take of the top 4 occuring highest frequency individual words and separate it out in the table.

Step 3: Calculate the Term Frequency

Term frequency is defined as the total frequency of any particular word in any given sentence .The formula of Term Frequency is defined as below :

We know that is the doc we created highlighted in yellow above has total in all 5 sentences and we calculate the occurence of top 4 high frequency words in each of these sentence.

sent 1 : Today morning teams begin practice session.

sent 2 : boys Kabaddi team go 1 round practice.

sent 3 : boys football team start practice.

sent 4 : boys cricket team practice.

sent 5 : girls volleyball team ready .

sen 6 : boys relay race team .

Step 4 : Calculate the Inverse Document Frequency

IDF gives us the measure of occurance of any particular word across all the given sentences in a document.

Step 5: Calculate weightage of word in a sentence

In this step we evaluate the impact of each word in a sentence by evaluating the product of each word Term Frequency in a sentence with the total IDF for the word.

From the above table we can have below conclusions.

Sentence 1 : The practice word is having more weightage indicating that the college is putting efforts in practice session.

Sentence 2 : The boys team is preparing for the game

Sentence 3 : The boys team is preparing for the game

Sentence 4 : The boys team is practicing for game

Sentence 5 : The girls team is practicing for game

Sentence 6 : Boys team is practicing hard for game.

And by calculating the total weightage for each of the words in the entire document it can be observed that the word boy has more weightage compared to others. Hence we can conclude that the college is focusing more on encouraging boys to compete in the upcoming competition.

In this way the TF and IDF helped us to identify the contribution of words in individual sentences. Also we could be able to identify that the college is focussing on which area more from the given document.

Hope this example helps you to understand things better!!

Thanks for reading :)

--

--

Charanraj Shetty
Analytics Vidhya

Irony is we humans design machines to replace humans :)