The quantitative value of text, tf-idf and more…

Varun
Analytics Vidhya
Published in
5 min readAug 16, 2020
Photo by Markus Winkler on Unsplash

Introduction

tf-idf, which stands for term frequency-inverse document frequency is used to calculate a quantitative digest of any document, which can be further used to find similar documents, classification of documents, etc.

This article will explain tf-idf, it’s variations and what is the impact of these variations on the model output.

tf-idf, which stands for term frequency-inverse document frequency is similar to Bag of Words (BoW) where documents are considered as a bag or collection of words/terms and converted to numerical forms by counting the occurrences of every term. The whole idea is to assign a weight to each term occurring in the document.

tf-idf takes it to one step further and also consider the relative importance of every term to the document in a collection (normally addressed as corpus) of documents.

Wikipedia summarises it well,

term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

Following is a sample collection of documents taken from Wikipedia which will be the corpus (of size 4) for this article.

Document Corpus

We will have to perform some text cleansing and pre-processing such as removing special characters, removing stop words and lemmatizing words in the corpus before further steps. Text look like this after these steps.

Processed text

Calculations

tf, term frequency, is the simplest way of calculating weights. As the name suggests, weight of a term t in document d is the number of occurrences of t in d, which is exactly the Bag of Words model.

Term Frequencies

In tf there is no notion of importance. idf, inverse document frequency, is used to introduce the importance factor of any term. This is required because some of the terms would have little or no discriminating power such as the collection of documents about machine learning would have term machine in almost all the documents and does not hold much discriminating relevance.

Document frequency, which is the number of documents in the corpus that contains the term t, is used to scale the weight(factor of importance) of term t. idf of a term t in the document collection is defined as,

where,

N is the number of documents in the collection, dft is the document frequency of term t

idf values

Different variations of tf and idf

tf-idf,

tf-idf is the combination of tf and idf, which is the scaled version of weight. tf-idf of a term t present in a document d from a corpus of documents D is defined as,

tf-idf is highest for a t if it occurs many times within a small number of documents

tf-idf is lower for t when it occurs fewer times in a document, or occurs in many documents

tf-idf is lowest when t occurs in all the documents

We can see that the weight of data is more than computer even their term frequency is same for document 0. This is because of idf as data occurred in a smaller number of documents.

Machine occurred in all the documents, hence it has the 0 weight. We can add 1 to idf to avoid getting 0 for such terms as per the use case.

Sub linear tf-scaling,

It is not always true that multiple occurrences of a term in a document mean more significance of that term in proportion to the number of occurrences. Sublinear tf-scaling is modification of term frequency, which calculates weight as following,

In this case tf-idf becomes,

wf-idf

As intended sub linear tf-scaling has scaled down the weight of term algorithm as it is occurring multiple times(maximum tf) in first document.

Maximum tf-normalization,

This is another modification of term frequency where tf of every term occurring in a document is normalized by the maximum tf in that document.

where a is the smoothing term ranging between 0 to 1 and is generally set to 0.4.

Maximum tf-normalization handles the case when a long document has higher values of term frequencies just because of the length of the document and it will have same terms repeated again and again.

This approach falls short in the case when a document will have a term occurring unusually very high number of times.

High frequency terms such as algorithm and computer are scaled down as it is normalized by the maximum frequency. Terms with zero frequency are also having some weight because of the smoothing term.

Normalization,

We can normalize documents vectors either by L2 or L1 norm. After L2 normalization, sum of elements of every document vector will be 1. In this case cosine similarity between any two document vectors is just the dot product of the vectors. In case of L1 normalization, sum of absolute values of elements of very document vector becomes 1.

Code used in the article can be found at this link — https://github.com/varun21290/medium/blob/master/tfidf/tfidf.ipynb

Scikit-Learn provides most of these calculations out of the box, check links in the references.

--

--