An Introduction to Bag-of-Words in NLP

This post will take you into a deeper dive into Natural Language Processing. Before you move on, make sure you have your basic concepts cleared about NLP which I spoke about in my previous post — “A dive into Natural Language Processing”.

Let’s move on!

What is Bag-of-Words?

We need a way to represent text data for machine learning algorithm and the bag-of-words model helps us to achieve that task. The bag-of-words model is simple to understand and implement. It is a way of extracting features from the text for use in machine learning algorithms.

Source

In this approach, we use the tokenized words for each observation and find out the frequency of each token.
Let’s take an example to understand this concept in depth.

“It was the best of times”
“It was the worst of times”
“It was the age of wisdom”
“It was the age of foolishness”

We treat each sentence as a separate document and we make a list of all words from all the four documents excluding the punctuation. We get,

‘It’, ‘was’, ‘the’, ‘best’, ‘of’, ‘times’, ‘worst’, ‘age’, ‘wisdom’, ‘foolishness’

The next step is the create vectors. Vectors convert text that can be used by the machine learning algorithm.

We take the first document — “It was the best of times” and we check the frequency of words from the 10 unique words.
“it” = 1
“was” = 1
“the” = 1
“best” = 1
“of” = 1
“times” = 1
“worst” = 0
“age” = 0
“wisdom” = 0
“foolishness” = 0

Rest of the documents will be:
“It was the best of times” = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
“It was the worst of times” = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
“It was the age of wisdom” = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
 “It was the age of foolishness” = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

In this approach, each word or token is called a “gram”. Creating a vocabulary of two-word pairs is called a bigram model.

For example, the bigrams in the first document : “It was the best of times” are as follows:
“it was”
“was the”
“the best”
“best of”
“of times”

The process of converting NLP text into numbers is called vectorization in ML. Different ways to convert text into vectors are:

  • Counting the number of times each word appears in a document.
  • Calculating the frequency that each word appears in a document out of all the words in the document.

CountVectorizer

CountVectorizer works on Terms Frequency, i.e. counting the occurrences of tokens and building a sparse matrix of documents x tokens.

TF-IDF Vectorizer

TF-IDF stands for term frequency-inverse document frequency. TF-IDF weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

  • Term Frequency (TF): is a scoring of the frequency of the word in the current document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. The term frequency is often divided by the document length to normalize.
  • Inverse Document Frequency (IDF): is a scoring of how rare the word is across documents. IDF is a measure of how rare a term is. Rarer the term, more is the IDF score.

Thus,