TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Photo by Mohamed Nohassi on Unsplash

Member-only story

Converting Texts to Numeric Form with TfidfVectorizer: A Step-by-Step Guide

How to calculate Tfidf values manually and using sklearn

6 min readOct 25, 2023

--

TFIDF is a method to convert texts to numeric form for machine learning or AI models. In other words, TFIDF is a method to extract features from texts. This is a more sophisticated method than the CountVectorizer() method I discussed in my last article.

The TFIDF method provides a score for each word that represents the usefulness of that word or the relevance of the word. It measures the usage of the word compared to the other words present in the document.

This article will calculate the TFIDF scores manually so that you understand the concept of TFIDF clearly. Toward the end, we will see how to use the TFIDF vectorizer from the sklearn library as well.

There are two parts to it: TF and IDF. Let’s see how each part works.

TF

TF is elaborated as ‘Term Frequency’. TF can be calculated as:

TF = # of occurrence of a word in a Document

OR

TF = (# of occurrence in a document) / (# of words in a document)

Let’s work on an example. We will find the TF for each word for this document:

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Rashida Nasrin Sucky
Rashida Nasrin Sucky

Written by Rashida Nasrin Sucky

MS in Applied Data Analytics from Boston University. Read my blog: https://regenerativetoday.com/

Responses (2)