Bag of Words- NLP

Deepak Rawat
4 min readJun 1, 2022

--

A quick guide on Bag of Words, a NLP technique used to vectorize text data

Why do we need to convert text to vectors?
Like mentioned in previous article, in Data science/Machine learning, given any problem, if we are able to convert any data into its Vector form, we can leverage Linear Algebra techniques to process the data and get insights.

For example: In a simple case of classifying a review (text data) as positive/negative, if we are able to plot the data into d-dimensional vector, it would be extremely easy to plot those vectors/points in coordinate space. And since we would have all the points in the coordinate space, we can find a hyperplane that could separate the vectors/points belonging to the two classes (positive/negative)
Now a very common question that we would have is how to decide if two points belong to same class OR if two points are similar.

There are actually various ways to calculate similarity between two points, we will focus on a simple measure, Distance.
In simpler words, two set of points are said to be “more” similar if the distance between them is “shorter” as compared to a set of other two points.

Example: r1, r2, r3 are three vectors which may denote any data.
If similarity (r1,r2) > similarity(r1,r3) then distance (r1,r2) < distance (r1,r3)

One of the common techniques in NLP for converting text to vectors is Bag of Words (BOW)
Bag of Words work on the principle that if documents are more similar semantically, then their corresponding vectors will be closer in d-dimension space (read more to understand better)

We will consider few reviews to explain BOW and how the BOW matrix is created.

review1 (r1): this coffee is very bad and is expensive
review1 (r2): this coffee is not bad and is cheap
review1 (r3): this coffee is not amazing and is not affordable

lets clarify few more terms before moving further:
Document: a document is a text item. In this case it is each single review r1, r2 and r3
Corpus: Corpus is the collection of all the documents
Dictionary: a set of all unique words in all the documents
Sparse Vector: A vector having most of the dimensions in it with a 0 value
Dense Vector: Opposite of Sparse, having more non-zero values than 0 values

Steps in BOW:
Step 1) Construct a dictionary with all the unique words in all the documents.
Thus, the dictionary constructed out of the above three reviews r1, r2 and r3 will be:

Step 2) Construct a vector for every text by counting number of time each word in each text occurred out of the dictionary

The resultant Bag of Words matrix above, will always be a Sparse Vector.
We can use the above Sparse matrix to either calculate various types of distances between vectors/points or for other feature engineering techniques.

Implementing BOW in Python:
BOW is created from a text corpus using Scikit Learn’s CountVectorizer function

Step1 ) Import packages and corpus

Step2) Load CountVectorizer() function and check the corpus by using “vectorizer.vocabulary_)

Step3) Transform all docs in the corpus and check the array

You will see that the output array from CountVectorizer() matches with the one calculated manually.

Github link for code

Disadvantage of Bag of Words:
One of the disadvantages of BOW is that even after vectorization, although the distance between two vectors could be large, but still the semantic meaning could be the similar.
Example:
Length(r1,r2)= sqrt(4)=2
Length(t1,t3)=sqrt(10)=3.16
Length(t2,t3)=sqrt(5)=2.23
Although, t1 and t3 vectors are far from each other, their semantic meanings are somewhat similar

Variations of Bag of Words:
1) Count BoW: The approach that we covered above is called he Count Bag of Words, since we are counting the occurrence of each word in the dictionary

2) Binary BOW: In Binary BOW instead of counting the occurrence of a word, we just mention if the word occurred in the dictionary. If yes, we mark as 1 else 0
Choosing one approach over other is problem specific.

Read more about another important text vectorization technique called TF_IDF in the next article.

*Note: Text pre-processing techniques are not used before creating TF-IDF matrix. I plan to explain various text pre-processing techniques in a separate article.

--

--