Understanding TF-IDF for Absolute Beginners
Term Frequency — Inverse Document Frequency (TF-IDF) basically measures how relevant a word is with respect to the document that contains it, and the collection of documents.
Some Terms
- You can think of the term document as a sentence
- You can think of the term collection of documents as a list of sentences
Here’s a super simple example
documents = [
"apple apple",
"apple orange"
]
The variable documents is the collection of documents, as it is a list that contains many sentences. Each sentence “apple apple” and “apple orange” is considered a document here (I know they’re not real sentences), as they contain multiple words each.
The Formula
The TFIDF of a certain word in a document can be found by multiplying 2 numbers — the TF (Term Frequency) and the IDF (Inverse Document Frequency)
Term Frequency (TF) — How many times this word appears in a document (sentence)
Inverse Document Frequency (IDF) — the natural logarithm of the total number of documents divided by the total number of documents that contain this certain word + 1