How to count occurance of words using sklearn’s CountVectorizer

Crystal X
Geek Culture
Published in
5 min readAug 19, 2021

--

The last few posts I have written about have been regarding natural language processing, or NLP. There are not a lot of datasets that deal specifically with NLP, so I have not written a great deal about it, which is why I focused my activities in other areas. Kaggle, the top data science website, has recently posted a competition that concerns NLP, so I thought this would be a good time to delve into the subject. The link to the most recent post I have written on NLP can be found here:- https://medium.com/geekculture/different-ways-to-calculate-cosine-similarity-in-python-ae5bb28c372c

In this post I am going to broach the subject of sklearn’s CountVectorizer. CountVectorizer converts a collection of text documents into a matrix of token counts. The text documents, which are the raw data, are a sequence of symbols that cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors of a fixed size rather than text documents with variable lengths. In order to address this problem, sklearn provides utilities to tokenise, count and normalise data.

In this post, therefore, I will endeavour to focus on the counting mechanism of this process. In the counting part of the process, CountVectorizer counts the occurrence in each document. Each individual occurrence frequency is treated…

--

--

Crystal X
Geek Culture

I have over five decades experience in the world of work, being in fast food, the military, business, non-profits, and the healthcare sector.