Bag of Words — The easiest explanation of NLP using python

Rohit Madan
Analytics Vidhya
Published in
4 min readDec 4, 2019

Today I am going to explain Bag of Words technique to you.

If you’re here, you probably know why we use it but if you do not know then I’ll tell you with an example.

Explain Bag of Words with an example?

Go to your gmail, open priority inbox and see google magic up your inbox by categorizing Important, Social, spam etc of your all emails.

Remember now?

How does google know some mails are important to you while other’s aren’t ?

Although many factors come in to play, one prime factor is a machine reads your mails and then understands what’s important to you, and then shows it to you, Voila.

The catch is, machine’s do not know what is English , it only understands numbers and so what it does is that it breaks all your documents into words, something like -

Email

Hi Mr Madan,

Congratulations, I loved your article as it was able to explain to me what bag of words does in a simple manner.

How machine breaks it

Hi, Mr, Madan, Congratulations, I,loved, your, article …………in, a , simple , manner.

This splitting of words is then -

  1. Cleaned or preprocessed — Remove all unnecessary special characters, if there are words of other accent like Polish, German, spanish etc. remove or replace them or add the right unicode to make them readable for machine.
  2. Normalize all data — Using .lower() function remove any capital letter words from the data.
  3. Lamentization and steming — Remove all adjectives/build on words from data i.e. Baked, baking, baker are all build on words over bake. Categorising all such words from data to the root word. Also, removing all stop words i.e. all words that add no meaning or dimension to the feature such as a, the, etc.

Next once we have all words, we tokenise it i.e. write count of repetition of all words in our document. Ex — This cat is a fat cat that is cute >> This, cat , is , a , fat , cat , that , is , cute >>> This — 1 , cat — 2 , is -2 , a -1 , fat — 1 , that -1 , cute — 1

or

This — 1

cat — 2

is -2

a — 1

fat — 1

that -1

cute — 1

This process is called tokenization.

Before we understand what is next to do, I will tell you why are we doing all this.

The goal is to find features or words in this case in a document that can help us provide either some meaning from the document or provide some help in comparing with similar non similar documents.

Breaking down the tokenized words of Document 1 help us compare it to other tokenized words of Document 2, hence help in finding similar documents.

Now back to bag of words.

After tokenization, we move to building vocabulary or finding features from the documents.

vocab = All final features after cleaning, removing stop words etc.

So if a document has 3 documents and each document has 7 words, a vocab is the best choice of words from the document which say is 10 words out of 21 words in our case.

Vocab count = count of all unique features or words which is 10 for us (Say).

After we find the vocab, we convert the words we finalized into vectors, how you ask ?

Say our vocab for “This cat is a fat cat that is cute” is

Cat — 2

is — 2

Fat — 1

Cute — 1

vocab count = 4

So vector of document “This cat is a fat cat that is cute” is

[022010001]

This number is nothing but a representation when you compare final vocab vs words in document i.e. upon comparing

This — 1 , cat — 2 , is -2 , a -1 , fat — 1 , that -1 , cute — 1

and Cat — 2, is — 2, Fat — 1, Cute — 1

We get [022010001]

This process is called vectorization.

Tada, these are all concepts of Bags of words, so now like I promised, I am sharing my code with you which is based upon sklearn, if you want to see how each process works via code, check this.

#Part 1 Declaring all documents and assigning to a document
document1=”This is my code on bag of words.”
document2=”Bag of words is a NLP technique.”
document3=”I will explain it to you in a simple way”

#Part 2 — Importing libraries and intializing CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer#Making a list or Document from all documents
Doc=[document1,document2,document3]
#Initializing CountVectorizer from sklearn
vectorizer = CountVectorizer(stop_words=’english’)

#Part 3 — Getting feature names of final words that we will use to tag this document

X = vectorizer.fit_transform(Doc)analyze = vectorizer.build_analyzer()analyze(document1)analyze(document2)analyze(document3)vectorizer.transform(document1).toarray()vectorizer.transform(document2).toarray()vectorizer.transform(document3).toarray()print(vectorizer.get_feature_names())Output>>>[‘bag’, ‘code’, ‘explain’, ‘nlp’, ‘simple’, ‘technique’, ‘way’, ‘words’]

#Part 4 — Vectorizing or creating a matrix of all three documents

print(X.toarray())

Output>>> [[1 1 0 0 0 0 0 1] [1 0 0 1 0 1 0 1] [0 0 1 0 1 0 1 0]]

Or

Go checkout my Github here > Check Bag of words code.

I hope this was simple and helps you understand the concept, if you have feedback that can help me improve content or code, write to me — rohitmadan16@gmail.com

Peace.

--

--