How sklearn’s CountVectorizer and TfidfTransformer compares with TfidfVectorizer

Tracyrenee
Geek Culture
Published in
5 min readAug 20, 2021

--

In my most recent post I discussed sklearn’s CountVectorizer and how it is used, which is basically counting the occurrence of words in a corpus. In earlier posts I discussed TfidfVectorizer, which vectorizes a corpos and prepares it to be input into an estimator. What some people don’t realise is the fact that TfidfVectorizer is a function that performs the same tasks as CountVectorizer and TfidfTransformer.

The most recent post I have written on natural language processing, or NLP, can be found here:- https://medium.com/geekculture/how-to-count-occurance-of-words-using-sklearns-countvectorizer-a9a65815b1e6

Sklearn provides facilities to extract numerical features from a text document by tokenizing, counting and normalising. CountVectorizer performs the task of tokenizing and counting, while TfidfTransformer normalizes the data. TfidfVectorizer, on the other hand, performs all three operations, thereby streamlining the process of natural language processing.

In this post I intend to show how all three functions work and to show how CountVectorizer and TfidfTransformer work and prove they provide the same results as TfidfVectorizer. Both of these methods accomplish the same task of converting a collection of raw documents to a matrix of Tf-Idf features.

--

--

Tracyrenee
Geek Culture

I have five decades experience in the world of work, being in fast food, the military, business, non-profits, and the healthcare sector.