TF-IDF Basic information and logic.

Published in

Analytics Vidhya

3 min readApr 24, 2020

Tf-idf stands for “term frequency-inverse document frequency”. It is an algorithm or procedure used basically either in text mining/analysis or information retrieval. Tf-idf in short is an algorithm to provide weight i.e relevance to words in a sentence or document(set of words). This weight providing methodology is used to score and rank words or sentences known as page ranking.The importance increases proportionally to the number of times a word appears in the document but will vary by the frequency of the word in the corpus (data-set).

Procedure:

tf-idf is a weighting scheme that assigns each term in a document a weight based on its term frequency (tf) and inverse document frequency (idf). The terms with higher weight scores are considered to be more important.

Term Frequency- There are several ways to calculate either tf or idf the simple and basic formula is

Inverse Document frequency:- It is the simple inverse of document frequency i.e

Log is used as when the number of words and documents increase the idf value can explode and will be hard to get a value. Hence by putting log even we can get an exact value. Sometimes “+1” is added to the denominator so that log 0 dienst become infinity , provided log 0 happens which can in exceptions

Example:- Consider 3 documents:

document 1- I like eating ice when i like.

document 2- Sherry is always eating pear.

document 3- Chef is eating cream.

For the above document consider getting tf. “I” occurs twice in is 1st document of 7 words,i.e, 2/7 = 0.29.

Getting idf for the example. Know that idf is for all the documents together. “is” appears in 2 documents(2&3) and total document is 3,i.e, log(3/2).

Now, calculating the final weight of words and finding the bag of words by using formula :-

relevance=tf(document,word)*idf .

“I” has tf for document 1 “0.29” and idf of “I” is “1.09” is

0.29*1.09 = 0.3161~0.32.

Thus the bag or words is calculated an can be used for further classifying. Tf-idf and other feature extraction makes the classification more efficient and increases accuracy.

TF-IDF Basic information and logic.

Written by Tedious_wings