Member-only story
Converting Texts to Numeric Form with TfidfVectorizer: A Step-by-Step Guide
How to calculate Tfidf values manually and using sklearn
TFIDF is a method to convert texts to numeric form for machine learning or AI models. In other words, TFIDF is a method to extract features from texts. This is a more sophisticated method than the CountVectorizer() method I discussed in my last article.
The TFIDF method provides a score for each word that represents the usefulness of that word or the relevance of the word. It measures the usage of the word compared to the other words present in the document.
This article will calculate the TFIDF scores manually so that you understand the concept of TFIDF clearly. Toward the end, we will see how to use the TFIDF vectorizer from the sklearn library as well.
There are two parts to it: TF and IDF. Let’s see how each part works.
TF
TF is elaborated as ‘Term Frequency’. TF can be calculated as:
TF = # of occurrence of a word in a Document
OR
TF = (# of occurrence in a document) / (# of words in a document)
Let’s work on an example. We will find the TF for each word for this document: