tf-idf: Machine Learning through visuals. #2: What is “tf-idf”? How is it calculated? How to make sense of it?

Published in

Machine Learning through visuals

4 min readJun 24, 2018

Welcome to “Machine Learning through visuals”. In this series, I want the reader to quickly recall and more importantly retain the concepts through simple visual cues shown in this article. A large body of research indicates that visual cues help us to better retrieve and remember information. Here, the concepts are not discussed in detail. The assumption is, reader already knows the concept but wants a quick recap of it.

Lets get suspense out of the equation.

tf-idf stands for term frequency — inverse document frequency.

Lets learn/recall this concept with help of an example shown below.

Here we have 4 documents.

Indexing the vocabulary

OUTPUT:
{'bali': 1, 'island': 3, 'country': 2, 'peru': 5, 'south': 6, 'america': 0, 'japan': 4}

VSM representation with term frequencies

OUTPUT:
matrix([[0, 1, 1, 1, 0, 0, 0], 
        [1, 0, 1, 0, 0, 1, 1],
        [0, 0, 2, 0, 0, 0, 0],
        [0, 0, 1, 1, 1, 0, 0]])

Each row in the output matrix corresponds to document in the data_set. Each column corresponds to corresponding words from the vocabulary.
For example: column 0 is for ‘america’. column 1 is for ‘bali’ and so on.

Each element of the matrix is ‘term frequency’, i.e. number of times the word appears in that particular document.
For Example: ‘country’ appears 2 times in document 3.

Finally, idf (inverse document frequency)

idf(d, t) = log [ (1 + |D|) / (1 + df(d, t)) ] + 1.

|D| — cardinality of our document space. In this example it is ‘4’.
df(d, t)) — the number of documents that term t appears in (adding 1 into the formula to avoid zero-division).

IDF: 
[ 1.916  1.916  1. 1.510  1.916 1.916 1.916]

IDF is calculated for each feature (in this case each term t)

Example showing IDF calculated for term ‘island’

Using the above IDF, tf-idf is calculated by multiplying it with term frequency matrix.

Final step is to normalize each row of this matrix. This is to overcome a bias towards frequently repeated words in long documents which might make them look more important than they are just because of the high frequency of the term in the document.

Please note: TfidfTransformer does the above calculation and L2 normalization (by default) on the smatrix.

L2 normalization is the most famous method to normalize. Below is an image where L2 norm calculation is worked out on row1 for the above matrix.

L2 normalization applied on Document 1 tf-idf vector.

Final matrix after normalization is

OUTPUT:
[[ 0.   0.726   0.379   0.572   0.   0.   0.   ]
 [ 0.55 0.      0.288   0.      0.   0.55 0.552]
 [ 0.   0.      1.      0.      0.   0.   0.   ]
 [ 0.   0.      0.379   0.572   0.72 0.   0.   ]]

Block diagram of process described above:

Interpretation of tf-idf:

Larger the value, more important is the term for that document.
If a particular term say ‘t1’ exists in all the documents, along with some unique terms in a particular document. ‘t1’ is assigned less weightage than the unique terms in that document.
For example: in the first document 1 = ‘Bali is an island and not a country’
for tf_idf_matrix we can see row 1 has term ‘country’ assigned less weightage (=0.379) when compared to term ‘bali’ with weightage of 0.726 which exists only in that document. term ‘island’ is given weightage of 0.572 which is between that of ‘bali’ and ‘country’ as it appears in document 4 as well.
tf-idf matrix can be used as a feature set for your logistic regression or decision tree based machine learning algorithms.

tf-idf: Machine Learning through visuals. #2: What is “tf-idf”? How is it calculated? How to make sense of it?

Written by Amey Naik