What is the difference between CountVectorizer, HashingVectorizer & TfidfVectorizer?
In my last several posts I have been discussing sklearn’s functions regarding natural language processing, or NLP, because these algorithms cover a niche in machine learning that is not very heavily represented in the Kaggle competitions I have the capability to enter. As a result of this, a weakness in this genre of machine learning has been identified that needs to be remedied. NLP covers various types of programs, such as identifying classification text, developing question and answer systems, developing recommendation engines, or even creating a chatbot. My most recent post on the subject of NLP can be found here:- https://medium.com/geekculture/how-sklearns-countvectorizer-and-tfidftransformer-compares-with-tfidfvectorizer-a42a2d6d15a2
In my most recent post I discussed how sklearn’s TfidfVectorizer performs the same tasks as both CountVectorizer and TfidfTransformer together. In this post I will endeavour to discuss how HashingVectorizer can perform the same tasks as CountVectorizer.
Although HashingVectorizer performs a similar role to CountVectorizer, there are some similarities that need to be addressed. HashingVectorizer converts a collection of text documents to a matrix of token occurrences. This text vectorizer implementation uses the hashing trick to find the…