Efficient One-hot encoding for categorical features with high cardinality

Subrat Sekhar Sahu
Walmart Global Tech Blog
4 min readApr 27, 2023
Photo credit: Erik Karits

Tags: Data Science, Python, Machine Learning, Big Data, Artificial Intelligence

Typical Machine Learning models cannot work with categorical data, such columns need to be converted into numeric so that the algorithms can “understand” them. The simplest way of converting categorical columns into numeric is to use Label Encoding which maps each unique label in the column to a number. For example, a Gender column is label encoded by mapping Male and Female labels to 0 and 1 respectively. This, however, has a drawback. The ML algorithms might interpret Female labelled data to be having higher weightage than others since 1 > 0. One-hot encoding solves this problem for us.

One-hot encoding creates separate columns for each label it encounters in the categorical column and adds a 1 under the label which is true for current row.

But what happens when the dataset is fairly large and there are too many labels in a categorical column?

No. of columns created by One-hot encoding = No. of labels in categorical column

If there are 100 labels in a categorical column of a dataset of shape 10k X 10, after applying One hot encoding, the shape of dataset would be 10k X 109 (OHE adds 100 columns — one for each label and we remove the original categorical column). What happens when we have a bigger dataset with a categorical column having higher cardinality? Will the cost and resource requirement to train models on this dataset scale linearly? Definitely not. We should expect an exponential growth in compute time and cost is such circumstances.

Is there a viable solution this problem?

Use Data Representation Design Pattern: Embeddings.

But first, what are Embeddings?

An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.- Google Crash Course on Embedding

(Back to) How to represent One-hot Encoded columns as Embeddings?

After One-hot encoding, we pass the input data through an embedding layer that has trainable weights. This will map the high-dimensional, categorical input variable to a real-valued vector in some low-dimensional space. The weights to create the dense representation are learned as part of the optimization of the model. In practice, these embeddings end up capturing closeness relationships in the input data too.

For example, if we’re working on Plurality data having the following labels:

If we One-hot encode this data, we get:

When encoded in this way, we need six dimensions to represent each of the different categories. Six dimensions may not be so bad, but what if we had many, many more categories to consider?

To reduce the dimensionality, we can generate embeddings for this data as follows:

With this, we just need 2 additional dimensions to represent all the labels in Plurality.

Why does it work? The embedding layer is just another hidden layer of the neural network. The weights are then associated to each of the high-cardinality dimensions, and the output is passed through the rest of the network. Therefore, the weights to create the embedding are learned through the process of gradient descent just like any other weights in the neural network. This means that the resulting vector embeddings represent the most efficient low-dimensional representation of those feature values with respect to the learning task. This ultimately aids the model performance.

Any demerits? This methodology suffers from some information loss while converting the high dimensional data to low dimensions. Also, explainability of the model trained on this data goes for a toss. Of course, there are alternatives but this is a simple and lucrative one.

References

· https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture

· https://mlinreallife.github.io/posts/ml-design-patterns/

· https://learning.oreilly.com/library/view/machine-learning-design/9781098115777/ch02.html

--

--