Understanding TF-IDF for Absolute Beginners

Liu Zuo Lin
Analytics Vidhya
Published in
5 min readAug 29, 2021

--

Term Frequency — Inverse Document Frequency (TF-IDF) basically measures how relevant a word is with respect to the document that contains it, and the collection of documents.

Some Terms

  1. You can think of the term document as a sentence
  2. You can think of the term collection of documents as a list of sentences

Here’s a super simple example

documents = [
"apple apple",
"apple orange"
]

The variable documents is the collection of documents, as it is a list that contains many sentences. Each sentence “apple apple” and “apple orange” is considered a document here (I know they’re not real sentences), as they contain multiple words each.

The Formula

The TFIDF of a certain word in a document can be found by multiplying 2 numbers — the TF (Term Frequency) and the IDF (Inverse Document Frequency)

Term Frequency (TF) — How many times this word appears in a document (sentence)

Inverse Document Frequency (IDF) — the natural logarithm of the total number of documents divided by the total number of documents that contain this certain word + 1

--

--