Member-only story
CountVectorizer to Extract Features from Texts in Python, in Detail
Everything you need to know to use CountVectorizer efficiently in Sklearn
The most basic data processing that any Natural Language Processing (NLP) project requires is to convert the text data to the numeric data. As long as the data is in text form we cannot do any kind of computation action on it.
There are multiple methods available for this text-to-numeric data conversion. This tutorial will explain one of the most basic vectorizers, the CountVectorizer method in the scikit-learn library.
This method is very simple. It takes the frequency of occurrence of each word as the numeric value. An example will make it clear.
In the following code block:
- We will import the CountVectorizer method.
- Call the method.
- Fit the text data to the CountVectorizer method and, convert that to an array.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
#This is the text to be vectorized
text = ["Hello Everyone! This is Lilly. My aunt's name is also Lilly. I love my aunt.\
I am trying to learn how to use count vectorizer."]
cv= CountVectorizer()
count_matrix = cv.fit_transform(text)…