Stop One-Hot Encoding your Categorical Features — Avoid Curse of Dimensionality

Techniques to Encode Categorical Features with many Levels/Categories

Satyam Kumar
The Startup

--

Image by Gerd Altmann from Pixabay

Feature Engineering is one of the important elements of the data science model development pipeline. A data scientist spends most of their time doing data processing and feature engineering, in order to train a robust model. The dataset constitutes various types of features including categorical, numerical, text, DateTime, etc.

Since most of the machine learning models understand numerical vectors, so all kinds of features need to be engineered to numerical format. There are various encoding techniques to transform text data, into a numerical format, including Bag of Words, Tf-Idf vectorization, and many more. Categorical features can be encoded into a numerical format, using several techniques, One-Hot Encoding being one of them.

What is One-Hot Encoding?

One-hot encoding, also known as dummy encoding, is a method to convert categorical variables to numerical vector format. Each of the categories has its column or feature in the numerical vector and is converted to a numerical vector of 0’s and 1’s.

Why One-Hot Encoding is not feasible for categories…

--

--