Bag of Words in Machine Learning
How machine learning models understand text
Today we’ll cover the what and why of bags of words.
Machine learning models require numerical data as input. We call these numerical representations “vectors”. So if you’re working with text you’ll need to convert the text into a vector before feeding it to a model.
Building vectors from text typically involves one of the below approaches:
1) bag of words
2) word embeddings
We’ll only cover #1 in this post.
Bag of words (BOW)
We call it a bag of words to emphasize the fact that the order of words is not taken into account.
When building a BOW, each unique word in your set of training documents is assigned a unique index (its own place) in a fixed length array (list) of all unique words.
So if the sentences in your training dataset are:
1) “The quick brown fox”
2) “Jumped over the lazy dog”
3) “The dog barked quickly”
The BOW model would look like:
Note that each unique word is only listed once, and that we lost word order.