Bag of Words in Machine Learning

How machine learning models understand text

GreekDataGuy
The Startup

--

An overly simlified pipeline for training a model. We’re only covering converting text to vectors here.

Today we’ll cover the what and why of bags of words.

Machine learning models require numerical data as input. We call these numerical representations “vectors”. So if you’re working with text you’ll need to convert the text into a vector before feeding it to a model.

Building vectors from text typically involves one of the below approaches:
1) bag of words
2) word embeddings

We’ll only cover #1 in this post.

Bag of words (BOW)

We call it a bag of words to emphasize the fact that the order of words is not taken into account.

When building a BOW, each unique word in your set of training documents is assigned a unique index (its own place) in a fixed length array (list) of all unique words.

So if the sentences in your training dataset are:
1) “The quick brown fox”
2) “Jumped over the lazy dog”
3) “The dog barked quickly”

The BOW model would look like:

Note that each unique word is only listed once, and that we lost word order.

--

--