Analytics Vidhya
Published in

Analytics Vidhya

Quick Introduction to Bag-of-Words (BoW) and TF-IDF for Creating Features from Text

The Challenge of Making Machines Understand Text

cc: Siobhán Grayson / CC BY-SA (https://creativecommons.org/licenses/by-sa/4.0)

Let’s Take an Example to Understand Bag-of-Words (BoW) and TF-IDF

I’ll take a popular example to explain Bag-of-Words (BoW) and TF-DF in this article.

cc[https://www.piqsels.com/en/public-domain-photo-stmmn]
  • Review 1: This movie is very scary and long
  • Review 2: This movie is not scary and is slow
  • Review 3: This movie is spooky and good

Creating Vectors from Text

Can you think of some techniques we could use to vectorize a sentence at the beginning? The basic requirements would be:

  1. It should not result in a sparse matrix since sparse matrices result in high computation cost
  2. We should be able to retain most of the linguistic information present in the sentence
  1. BoW, which stands for Bag of Words
  2. TF-IDF, which stands for Term Frequency-Inverse Document Frequency

Bag of Words (BoW) Model

The Bag of Words (BoW) model is the simplest form of word representation. Like the term itself, we can represent a sentence as a bag of words vector (a string of numbers).

  • Review 1: This movie is very scary and long
  • Review 2: This movie is not scary and is slow
  • Review 3: This movie is spooky and good

Drawbacks of using a Bag-of-Words (BoW) Model

In the above example, we can have vectors of length 11. However, we start facing issues when we come across new sentences:

  1. If the new sentence is much longer in length, our vocabulary would increase and thereby, the length of our vectors would increase too
  2. The new sentences may contain more unknown words and if we keep a fixed vector size, we have to ignore them
  3. Additionally, the vectors would also contain many 0s, thereby resulting in a sparse matrix (which is what we would like to avoid)
  4. We are retaining no information on the grammar of the sentences nor on the ordering of the words in the actual documents

Term Frequency-Inverse Document Frequency (TF-IDF)

Let’s first put a formal definition around TF-IDF. Here’s how Wikipedia puts it:

Term Frequency (TF)

Term Frequent (TF) is a measure of how frequently a term, t, appears in a document, d:

  • Vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’
  • Number of words in Review 2 = 8
  • TF for the word ‘this’ = (number of times ‘this’ appears in review 2)/(number of terms in review 2) = 1/8
  • TF(‘movie’) = 1/8
  • TF(‘is’) = 2/8 = 1/4
  • TF(‘very’) = 0/8 = 0
  • TF(‘scary’) = 1/8
  • TF(‘and’) = 1/8
  • TF(‘long’) = 0/8 = 0
  • TF(‘not’) = 1/8
  • TF(‘slow’) = 1/8
  • TF( ‘spooky’) = 0/8 = 0
  • TF(‘good’) = 0/8 = 0

Inverse Document Frequency (IDF)

This is a measure of how important a term is. We need the IDF value because computing just the TF alone is not sufficient to understand the importance of words:

  • IDF(‘movie’, ) = log(3/3) = 0
  • IDF(‘is’) = log(3/3) = 0
  • IDF(‘not’) = log(3/1) = log(3) = 0.48
  • IDF(‘scary’) = log(3/2) = 0.18
  • IDF(‘and’) = log(3/3) = 0
  • IDF(‘slow’) = log(3/1) = 0.48
  • TF-IDF(‘movie’, Review 2) = 1/8 * 0 = 0
  • TF-IDF(‘is’, Review 2) = 1/4 * 0 = 0
  • TF-IDF(‘not’, Review 2) = 1/8 * 0.48 = 0.06
  • TF-IDF(‘scary’, Review 2) = 1/8 * 0.18 = 0.023
  • TF-IDF(‘and’, Review 2) = 1/8 * 0 = 0
  • TF-IDF(‘slow’, Review 2) = 1/8 * 0.48 = 0.06

End Notes

Let me summarize what we’ve covered in the article:

  1. Bag of Words just creates a set of vectors containing the vocabulary words and their occurrences in the document, while the TF-IDF model contains information on the more important words and the less important ones as well.
  2. Bag of Words model can be used for simpler tasks since it is easy to understand and interpretable. For more complicated tasks, however, we need TF-IDF.
  1. https://en.wikipedia.org/wiki/Tf%E2%80%93idf
  2. https://maelfabien.github.io/machinelearning/NLP_2/#2-term-frequency-inverse-document-frequency-tf-idf
  3. https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Purva Huilgol

Data Science Product Manager at Analytics Vidhya. Masters in Data Science from University of Mumbai. Research Interest: NLP