Yet another ML categorical variables encoding post

…and a “Robust” scikit-learn Label Encoder

Published in

Analytics Vidhya

3 min readDec 11, 2020

1. Introduction

Machine Learning algorithms/models work by having as input numerical valued features, here we can consider some real-world application examples like age, income, days since last transaction and many others.

Why we need categorical values encoding?
Logistic regression and Neural Nets are simple or complex nested numerical functions, Random Forests and GBMs work by finding the “best” feature/value pair for the next split in a “greedy” manner based on numerical ranges scan, and so on.

What about categorical valued features like gender and postal code?
In such cases what we need is to encode the values of these features as numbers and — as we will see — there are several ways to achieve this.

* Some libraries like LightGBM use built-in partitioning techniques in order to handle categorical features, you can find more info here: https://lightgbm.readthedocs.io/en/latest/Features.html#optimal-split-for-categorical-features

2. Categorical Feature Encoders

Although there are many ways to encode categorical variables, within this post we will see some of the most popular ones: Label Encoding and One-Hot encoding.

Label Encoding
In label encoding we map each categorical feature value to an integer number starting from 0 to cardinality-1, where cardinality is the count of the feature’s distinct values. You can check the scikit-learn implementation here.

One-Hot encoding
In one-hot encoding we map each value of the feature to a distinct binary “dummy” feature, for example if we have the feature gender with values “Male” and “Female” then we replace the one feature gender with two new features is_Male and is_Female that can take values 0 or 1. Note that if the cardinality is high enough then we will have to replace one feature with much more, something that can lead into memory consumption explosion.
Scikit-learn One-Hot encoder can be found in this link.

3. Pros and Cons

In the below table we can see some of the pros and cons of the two categorical values encoding methods:

Pros and cons of Label Encoding and One-Hot Encoding.

4. Handling future input data

One of the most important aspects of real-world/enterprise level ML models is to be able to handle future input data, both in terms of generalization (i.e. keep the same quality of predictions as in the training data set) and ability to execute. For the second case especially, imagine that have trained and evaluated your model, so you are ready to deploy into production.
Suppose that there is categorical feature in your model called “postal_code” that has 150 different values (cardinality) in your training data-set and you have applied one of the two techniques above for encoding.

What will be happen if a new example for prediction arrives with a postal_code value different than the 150 values that exist in your training data-set?

The scikit-learn Label Encoder will fail, because it is not fitted by using this new value giving the below error:
“ValueError: y contains previously unseen labels: [‘new_value’]”
The One-Hot encoder — even worse — will have to create a new “dummy” binary feature “postal_code_new_value” and thus lead to a different dimensionality for predictor features space (increased by one). Your ML model will not be able to handle this case and produce a prediction, unless you explicitly define the predictor features for both training and scoring phases.

In the next section we will see how to overcome this problem in the case of scikit-learn Label Encoder.

5. A “Robust” scikit-lean Label Encoder

You can find below the code modifications needed in order to have a version of the scikit-learn Label Encoder that can handle unseen input values.

A “Robust” scikit-learn Label Encoder

6. References

[1] Scikit-learn pre-processing: https://scikit-learn.org/stable/modules/preprocessing.html