Photo by Ash Edmonds on Unsplash

Member-only story

An Easier Way to Encode Categorical Features

Using the python category encoder library to handle high cardinality variables in machine learning

Rebecca Vickery
4 min readOct 12, 2019

--

I have recently been working on a machine learning project which had several categorical features. Many of these features were high cardinality, or in other words, had a high number of unique values. The simplest method of handling categorical variables is usually to perform one-hot encoding, where each unique value is converted into a new column with 1 or a 0 denoting the presence or absence of this value. However, when the cardinality of a feature is high this method will often produce so many new features that the model performance decreases.

I started to write my own encoders to try alternative methods to encode some of the features starting with something called weight of evidence. In a binary classification problem weight of evidence uses the distribution of unique values in the feature in both the positive and negative class and creates a new feature relating to these values. Naturally, this took a while to encode and then get it to work in my existing scikit-learn pipeline.

Then I stumbled across this library called category_encoders which has, not only weight of evidence but pretty much every possible way to encode categorical features already…

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Rebecca Vickery
Rebecca Vickery

Written by Rebecca Vickery

Data Scientist | Writer | Speaker

Responses (4)