Text Preprocessing Part — 2

5 min readSep 7, 2023

Text preprocessing is an integral part of Natural Language Processing as no machine learning model can process textual data directly. In this part we will see in detail about the different Text Vectorization mechanisms that the AI developers community has to offer… I would suggest you to read my previous blog on general and high level overview of text preprocessing methods before the vectorization part, but it is not mandatory as you do not need to read that to understand this part of my text preprocessing series.

Text Preprocessing for NLP part — 1

Now let us understand why we need to vectorize the textual data before starting with the actual vectorization part, for starters who have no idea about computers or generally have no idea about fields related to computer science… Computers simply cannot process text, they can only process numbers so it is important that we have to take this into consideration and employ efficient ways to convert our text to numbers(generally into a vector).

Vectorization is a classic approach of converting input data from its raw format (i.e. text ) into vectors of real numbers which is the format that ML models support. This approach has been there ever since computers were first built, it has worked wonderfully across various domains, and it’s now used in NLP.
In Machine Learning, vectorization is a step in feature extraction. The idea is to get some distinct features out of the text for the model to train on, by converting text to numerical vectors.

There are many vectorization techniques these are some important and widely used ones:

Bag of Words / Count Vectorization
Word Embeddings (Word2Vec, GloVe, FastText)
TF-IDF (Term Frequency-Inverse Document Frequency)
Trained Embeddings (Transformer-based Models, BERT Embeddings)

In This second part we will see about Bag of Words in depth and I will talk about the other two topics in an upcoming articles.

Bag of Words

The Bag of Words is one of the simplest forms of vectorizing text when it comes to machine learning. Bag of words involves a simple mechanism as follows,

1. Take all unique words in a given text each word corresponds to a token i.e. in “Hello I feel great”, there are 4 tokens, one token corresponds to each word in the sentence…

2. Now given a large dataset of many sentences or an entire paragraph so to say… we store all unique words as tokens

3. Now, we convert each sentence into a vector, representing the count of each unique tokens in that sentence. This can be kind of hard to understand for some people but let us make it easy with the following example:

sentence 1 : “Hey you look great and you did this”

sentence 2 : “you are great”

sentence 3: “I am surprised that you did this”

Now let us parse the given three sentences,

i. Get all the unique tokens

[“Hey”, ”you”, ”look”, “great”, “and”, “did”, “this” , “are”, “I”, “am”, “surprised”, “that”]

ii. Count Vectorize each sentence according to the number of occurrences tokens

sentence 1: “Hey you look great and you did this” → [1,2,1,1,1,1,0,0,0,0,0]

This is because when you see the token vector [“Hey”, ”you”, ”look”, “great”, “and”, “did”, “this” , “are”, “I”, “am”, “surprised”, “that”] and sentence “Hey you look great and you did this”, the word “Hey” comes once, “you” comes twice, “look” comes once , and on and on until we see “are” in the token vector but we don’t see it in the sentence so the count is 0.

similarly,

sentence 2: “you are great” → [0,1,0,1,0,0,0,1,0,0,0,0]

sentence 3: “I am surprised that you did this” → [0,1,0,0,0,1,1,0,1,1,1,1]

I Hope after looking into the example you’ll have a much clearer and greater understanding.

If you still do not understand have a look at fig-2 which shows a clear picture of how this process works.

Code Implementation of Count Vectorization:

Count Vectorization is a popular Machine Learning preprocessing algorithm for textual data that uses the Bag of Words concept as the base…

The scikit-learn library provides an excellent Count Vectorization mechanism.

from sklearn.feature_extraction.text import CountVectorizer #import the Countvectorizer class

vectorizer = CountVectorizer() #instantiate an object of the class

Once the above process is done define a list with a set of sentences and fit the vectorizer

sentences = ["Hey you look great and you did this",
             "you are great",
              "I am surprised that you did this"]
vectorizer.fit(sentences)

Now you will have a fitted vectorizer now you can perform the transform operation in order to vectorize the text.

vec_sentences = vectorizer.transform(sentences) #countvectorizes the text
print(vec_sentences.toarray()) #printing out the vectorized text

Now you’ll get the following output:

[[1,2,1,1,1,1,0,0,0,0,0]
[0,1,0,1,0,0,0,1,0,0,0,0]
[0,1,0,0,0,1,1,0,1,1,1,1]]

Each representing the different sentences…

from sklearn.feature_extraction.text import CountVectorizer #import the Countvectorizer class

vectorizer = CountVectorizer() #instantiate an object of the class
sentences = ["Hey you look great and you did this",
             "you are great",
              "I am surprised that you did this"]
vectorizer.fit(sentences)
vec_sentences = vectorizer.transform(sentences) #countvectorizes the text
print(vec_sentences.toarray()) #printing out the vectorized text

You can also get the token vocabulary by doing the following…

print(vectorizer.vocabulary_)

and you’ll see something like this as the output

{"Hey":4, "you":1, "look":0, "great":2, "and":5, "did":7, "this":3 , "are":6, "I":10, "am":9, "surprised":11, "that":8}

Although we’ve used sklearn to build a Bag of Words model here, it can be implemented in a number of ways, with libraries like Keras, Gensim, and others. You can also write your own implementation of Bag of Words quite easily.

This is a simple, yet effective text encoding technique and can get the job done a number of times.

Hope this was helpful…

Text Preprocessing Part — 2

Code Implementation of Count Vectorization:

Written by Sanjithkumar