Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Follow publication

Photo by Towfiqu barbhuiya on Unsplash

Member-only story

CountVectorizer to Extract Features from Texts in Python, in Detail

Everything you need to know to use CountVectorizer efficiently in Sklearn

7 min readOct 21, 2023

--

The most basic data processing that any Natural Language Processing (NLP) project requires is to convert the text data to the numeric data. As long as the data is in text form we cannot do any kind of computation action on it.

There are multiple methods available for this text-to-numeric data conversion. This tutorial will explain one of the most basic vectorizers, the CountVectorizer method in the scikit-learn library.

This method is very simple. It takes the frequency of occurrence of each word as the numeric value. An example will make it clear.

In the following code block:

  • We will import the CountVectorizer method.
  • Call the method.
  • Fit the text data to the CountVectorizer method and, convert that to an array.
import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer

#This is the text to be vectorized
text = ["Hello Everyone! This is Lilly. My aunt's name is also Lilly. I love my aunt.\
I am trying to learn how to use count vectorizer."]

cv= CountVectorizer()
count_matrix = cv.fit_transform(text)…

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Rashida Nasrin Sucky
Rashida Nasrin Sucky

Written by Rashida Nasrin Sucky

MS in Applied Data Analytics from Boston University. Read my blog: https://regenerativetoday.com/

No responses yet