NLP-Stop Words And Count Vectorizer

Kamrahimanshu
3 min readMar 7, 2020

--

We have huge amount of text data and a lot of analysis can be done on that data.Social media is generating a huge amount of text data on a daily basis.

This article is specially for the beginners and explains how to remove stop words and convert sentences into vectors using simplest technique Count Vectorizer. Our aim is to convert sentences into vectors so that later can be used as input for different models.

Sample Data Set

data = [
‘There is someone at the door.’,
‘The crocodiles snapped at the boat.’,
‘Data is the new oil.’,
‘He is running towards the ball.’,
]

data is a list of few sentences we will work on.

First step is the removal of stopwords.Stopwords are the words which occur frequently and doesn’t provide any useful information.

we will use nltk to remove stopwords.Below is the code to remove stopswords from data.

from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
corpus = []
for sentence in data:
words = sentence.split(" ")
filtered_words = [word.lower() for word in words if not word in stopwords]
corpus.append(" ".join(filtered_words))
corpus

So,in first 2 lines we are importing packages and selecting stop words in english language.In third line we took an empty list corpus in which we will store filtered sentences.Now iterating over the data and splitting each sentence by space and checking if the word is not in stop words we will add word into list of filtered words and finally appending all filtered words together to form a string and append it into corpus.

['there someone door',
'the crocodiles snapped boat',
'data new oil',
'he running towards ball']

Above is output obtained after removing stop words.

Now we need to convert these sentences into numbers as machine can only works with numbers.

There are many techniques to convert words into numbers but in this article i am explaining the basic one Count Vectorizer just to get the basic understanding.

Count Vectorizer

Step 1: Find all the unique words in the data and make a dictionary giving each unique word a number.In our use case number of unique words is 14 and dictionary is

unique_words = {‘there’: 12,‘someone’: 10,‘door’: 4,‘the’: 11,‘crocodiles’: 2,
‘snapped’: 9, ‘boat’: 1,‘data’: 3, ‘new’: 6,‘oil’: 7,‘he’: 5,‘running’: 8,‘towards’: 13,‘ball’: 0}

Step 2: Now for each sentence we will create an array of zeros equal to the length of unique_words i.e 14.For sentence one it will be

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Step 3: Now we will go through each sentence and pick each word and count the number of times it appears into the sentence and will update the value of that word in array to the count.For sentence one “there someone door” first word is there and appears only once in sentence so will update value of there to 1.From dictionary we can see that index value of there is 12 so will update value at index 12 to 1.Updated array for sentence one will be

[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0]

Similarly we will do it for all rows and output is

[[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0],
[0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1]]

Finally we have successfully converted text into numbers.We have a python library to this all for us.Below is the code to convert directly list of sentences into vectors.

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus_vector = vectorizer.fit_transform(corpus)
corpus_vector = corpus_vector.toarray()

Summary

This article explained stop words removal and convert text data into number which can be used as input to our models.

--

--