TF-IDF Basic information and logic.

Tedious_wings
Analytics Vidhya
Published in
3 min readApr 24, 2020

Tf-idf stands for “term frequency-inverse document frequency”. It is an algorithm or procedure used basically either in text mining/analysis or information retrieval. Tf-idf in short is an algorithm to provide weight i.e relevance to words in a sentence or document(set of words). This weight providing methodology is used to score and rank words or sentences known as page ranking.The importance increases proportionally to the number of times a word appears in the document but will vary by the frequency of the word in the corpus (data-set).

Procedure:

tf-idf is a weighting scheme that assigns each term in a document a weight based on its term frequency (tf) and inverse document frequency (idf). The terms with higher weight scores are considered to be more important.

Term Frequency- There are several ways to calculate either tf or idf the simple and basic formula is

Inverse Document frequency:- It is the simple inverse of document frequency i.e

idf

Log is used as when the number of words and documents increase the idf value can explode and will be hard to get a value. Hence by putting log even we can get an exact value. Sometimes “+1” is added to the denominator so that log 0 dienst become infinity , provided log 0 happens which can in exceptions

Example:- Consider 3 documents:

document 1- I like eating ice when i like.

document 2- Sherry is always eating pear.

document 3- Chef is eating cream.

For the above document consider getting tf. “I” occurs twice in is 1st document of 7 words,i.e, 2/7 = 0.29.

tf example

Getting idf for the example. Know that idf is for all the documents together. “is” appears in 2 documents(2&3) and total document is 3,i.e, log(3/2).

idf example

Now, calculating the final weight of words and finding the bag of words by using formula :-

relevance=tf(document,word)*idf .

“I” has tf for document 1 “0.29” and idf of “I” is “1.09” is

0.29*1.09 = 0.3161~0.32.

“Bags of Words”

Thus the bag or words is calculated an can be used for further classifying. Tf-idf and other feature extraction makes the classification more efficient and increases accuracy.

--

--