Image for post
Image for post

Natural Language Processing using scikit-learn’s CountVectorizer or libraries such as spaCY and Gensim can provide powerful insights into text data, allowing us to extract topics which can then be added as features and regressed on to generate predictions.

However, what if we want to use the sparse matrix that Countvectorizer produces as a feature along with other categoricals or numerical features in the dataset?

The answer is Column-Transformer, and I’ll demonstrate it’s usage on some yelp review data.

spaCY’s lemmas are now very clean and can be processed by TFIDF into a sparse vector matrix once I process them back into strings. …







About

Ned H

A former textile worker weaves data into compelling stories.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store