Understanding Count Vectorizer

Yashika Sharma
The Startup
Published in
6 min readMay 21, 2020

Whenever we work on any NLP related problem, we process a lot of textual data. The textual data after processing needs to be fed into the model.

Since the model doesn’t accept textual data and only understands numbers, this data needs to be vectorized.

Reference

What do I mean by vectorized?

Before we use text for modeling we need to process it. The steps include removing stop words, lemmatizing, stemming, tokenization, and vectorization. Vectorization is a process of converting the text data into a machine-readable form. The words are represented as vectors.

However, our main focus in this article is on Count Vectorizer. Let’s get started by understanding the Bag of Words model:

Bag of Words(BoW)

Reference

As already mentioned, we cannot process text directly, so we need to convert it into numbers. The Bag of Words(BoW) model is a fundamental (and old way) of doing this.

The model is very simple as it discards all the information and order of the text and just considers the occurrences of the word. It converts the documents to a fixed-length vector of numbers.

A unique number is assigned to each word. Within the length of the vocabulary(vocabulary means a collection of all the unique words), the frequency of words is assigned. This is the encoding of the words, in which we are focusing on the representation of the word and not on the order of the word.

There are multiple ways with which we can define what this ‘encoding’ would be. Our focus in this post is on Count Vectorizer.

Count Vectorizer:

CountVectorizer tokenizes(tokenization means dividing the sentences in words) the text along with performing very basic preprocessing. It removes the punctuation marks and converts all the words to lowercase.

The vocabulary of known words is formed which is also used for encoding unseen text later.

An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document. The image below shows what I mean by the encoded vector.

Count Vectorizer sparse matrix representation of words. (a) is how you visually think about it. (b) is how it is really represented in practice.

The row of the above matrix represents the document, and the columns contain all the unique words with their frequency. In case a word did not occur, then it is assigned zero correspondings to the document in a row.

Imagine it as a one-hot encoded vector and due to that, it is pretty obvious to get a sparse matrix with a lot of zeros.

The scikit-learn library offers functions to implement Count Vectorizer, let’s check out the code examples.

Examples

In the code block below we have a list of text. Here each row is a document. We are keeping it short to see how Count Vectorizer works.

First things first, let’s do the import. Also, observe document containing the list of documents we are going to process:

from sklearn.feature_extraction.text import CountVectorizerdocument=["devastating social and economic consequences of COVID-19",
"investment and initiatives already ongoing around the world to expedite deployment of innovative COVID-19",
"We commit to the shared aim of equitable global access to innovative tools for COVID-19 for all",
"We ask the global community and political leaders to support this landmark collaboration, and for donors",
"In the fight against COVID-19, no one should be left behind"]

The second step is to initialize the object cv_doc for using Count Vectorizer and fitting it on our document:

cv_doc=CountVectorizer(document)vocab=cv_doc.fit(document)

The text has been preprocessed, tokenized(word-level tokenization: means each word is a separate token), and represented as a sparse matrix. The best part is it ignores single character during tokenization like I and a.

This is how our vocab looks like.

To see the complete vocabulary we can write vocab.vocabulary_ .

Note that the numbers here are not the count, they are the positions in the sparse matrix.

Further, there are some additional parameters you can play with.

  1. Stop words: You can pass the stop_words list as an argument. The stop words are words that are not significant and occur frequently. For example ‘the’, ‘and’, ‘is’, ‘in’ are stop words. The list can be custom as well as predefined.

Define your own list of stop words that you don’t want to see in your vocabulary.

cv1=CountVectorizer(document,stop_words=['the','we','should','this','to'])#check out the stop_words you sepcified
cv1.stop_words

2. min_df: min_df equals a number specifies how much importance you want to give to the less frequent words in the document. There might be some words that appear only once or twice and may qualify as noise.

What does min_df do?

min_df considers words that are only present in a minimum of 2 documents. We can also pass a proportion instead of an absolute number.

For example, min_df=0.25 ignores words that are present in less than 25% of the document

cv2=CountVectorizer(document, min_df=2)

3. max_df: Similar to min_df there is max_df which indicates the importance you want to give to the most frequent words. There might be some words that are very frequent and you don’t want to include in your vocab, in that case, max_df is used.

It’s opposite to min_df and considers words based on their presence in the maximum n number of documents specified.

Let’s test the proportion instead of the absolute number here. If words are present in more than 25% of the document they are ignored.

cv3=CountVectorizer(document, max_df=0.25)

4. Tokenizer: If you want to specify your custom tokenizer, you can create a function and pass it to the count vectorizer during the initialization. We have used the NLTK library to tokenize our text.

def tok():
#add your code here
cv4=CountVectorizer(document,tokenizer=tok)

5. Custom Preprocessing: The same goes for preprocessing if you want to include stemmer and lemmatizer for preprocessing the text, you can define a custom function just like we did for the tokenizer. Although our data is clean in this post, the real-world data is very messy and in case you want to clean that along with Count Vectorizer you can pass your custom preprocessor as an argument to Count Vectorizer. Keeping the example simple, we are just lowercasing the text followed by removing special characters.

def preprocess():
#add your code here
cv5=CountVectorizer(document,tokenixer=my_tok)

6. n-grams: Combination of words sometimes are more meaningful. Let’s say we have words ‘sunny’ and ‘day’, ‘sunny day’ combined makes more sense. This is bigram. We can also use character level and word level n-grams. ngram_range=(1,2) specifies we want to consider both unigrams(single words) and bigrams(a combination of 2 words).

cv6=CountVectorizer(document, ngram_range=(1,2))

7. Limiting Vocabulary size: We can mention the maximum vocabulary size we intend to keep using max_features. In this example we are going to limit the vocabulary size by 20.

cv7=CountVectorizer(document, max_features=20)

Phew! That’s all for now. CountVectorizer is just one of many methods to deal with textual data. The TF-IDF and embeddings are better methods to vectorize the data. More on that later.

To access the code used in this article, Check out the repository here.

Recommended Resources:

--

--

Yashika Sharma
The Startup

Data Engineer @Nortal I write mostly about tech but sometimes life and the lessons learned along. Feel free to connect on https://linkedin.com/in/yashika51 :)