tf-idf basics of information retrieval

Published in

Data Science Bootcamp

6 min readSep 28, 2022

Title: TDIDF (definition) tf–idf, tfidf, information retrieval, term frequency–inverse document frequency. Understanding TF-IDF formula in minutes. Uniqtech Guide to TF-IDF.

Introduction

TFIDF models how important keywords are within a document and also in the context of a collection of documents and texts known as a corpus. TFIDF is the key algorithm used in information retrieval. For the definition of Information Retrieval, read our flash card on Information Retrieval (IR) Definition. It is used in document retrieval.

Importance Factor explained in plain English: The importance factor is proportional to the frequency of the keyword appearance in the document, normalized by the length of the document (long docs don’t get advantages over short docs), and inversely proportional to the frequency of the word appearance in other documents in the corpus (importance factor is offset by how frequently the word appears in other documents). For the math formula, see wikipedia screenshot below. For the tfidf function scroll down to the tf-idf function section. Update: September 2023, thank you Langchain for linking to us. LOVE 🦜⛓️ . We are honored. We added a section on using Langchain for TF-IDF. Made the article flow better. Added clarifications.

Intuition: Let’s explain the intuition behind the offset calculation in this section. Using this discount formula, naturally more frequently appearing words can be discounted such as economics, economy, etc in the economist…

tf-idf basics of information retrieval

Introduction

Written by Uniqtech