Member-only story
An Easier Way to Encode Categorical Features
Using the python category encoder library to handle high cardinality variables in machine learning
I have recently been working on a machine learning project which had several categorical features. Many of these features were high cardinality, or in other words, had a high number of unique values. The simplest method of handling categorical variables is usually to perform one-hot encoding, where each unique value is converted into a new column with 1 or a 0 denoting the presence or absence of this value. However, when the cardinality of a feature is high this method will often produce so many new features that the model performance decreases.
I started to write my own encoders to try alternative methods to encode some of the features starting with something called weight of evidence. In a binary classification problem weight of evidence uses the distribution of unique values in the feature in both the positive and negative class and creates a new feature relating to these values. Naturally, this took a while to encode and then get it to work in my existing scikit-learn pipeline.
Then I stumbled across this library called category_encoders which has, not only weight of evidence but pretty much every possible way to encode categorical features already…