Count Vectorizer

Karan Arya
NLP Gurukool
Published in
2 min readDec 28, 2018

The model is commonly used in methods of document classification where the frequency/count of each word is used as a feature for training.

We create two text documents as follows:

text1 = "I love my cat but the cat sat on my face"
text2 = "I love my dog but the dog sat on my bed"

Word Tokenization

words1 = text1.split(" ")
words2 = text2.split(" ")
print(words1)

Output:

['I', 'love', 'my', 'cat', 'but', 'the', 'cat', 'sat', 'on', 'my', 'face']

Combining the Words into a Single Set

corpus = set(words1).union(set(words2))
print(corpus)

Output:

{'dog', 'but', 'love', 'face', 'I', 'my', 'bed', 'cat', 'sat', 'the', 'on'}

Count Vectorization

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"I love my cat but the cat sat on my face",
"I love my dog but the dog sat on my bed"
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names()
corpus_index = [n for n in corpus]
import pandas as pd
df = pd.DataFrame(X.T.todense(), index = feature_names, columns = corpus_index)
df.style

The count of all the words as features are exported to a data frame which can be utilized for further analysis.

print(X.T.toarray())

Output:

[[0 1]
[1 1]
[2 0]
[0 2]
[1 0]
[1 1]
[2 2]
[1 1]
[1 1]
[1 1]]

The same output is seen in the array format. Note: Transpose of array is taken.

Before you leave,

If you enjoyed this post, please make sure to follow the NLP Gurukool page and visit the publication for more exciting tutorials and blogs on machine learning, data science and NLP.

Please get in touch if you would like to contribute to our publication.

--

--