Categorically: Don’t explode — encode!

Phillip Chilton Adkins

Published in

Data Science at Microsoft

9 min readJun 27, 2023

Image generated by the author with Midjourney.

Read this article before you start preprocessing your data for tree-based models!

Also, check out the linked notebook for code to download and preprocess datasets and replicate experiments.

Preprocessing data is a huge part of building a Machine Learning model. Data in its raw form often can’t be fed directly into an algorithm. Preprocessing enables ML algorithms to ingest and learn from data.

There are standard preprocessing techniques that are useful in a variety of contexts. Techniques such as standardization, one-hot-encoding, missing-value imputation, and more are a great way to shape your data up for feeding into an algorithm… unless they’re not!

One class of algorithm for which a lot of the standard preprocessing “rules” don’t apply is tree-based models.

And when you’re working with tree-based models, these techniques can vary from useless to harmful. These techniques can be time wasters, increase the likelihood of errors, and can blow up memory, time, and compute requirements while lowering model quality!

Let’s talk about one particular preprocessing technique to think twice about when you’re working with tree-based models, and what to do instead.

One-hot-encoding — not so hot?

When working with categorical data, the top go-to method for converting categories into something that an ML algorithm can ingest is one-hot-encoding. And it’s a great technique! However, there are multiple reasons why one-hot-encoding is a poor choice for tree-based algorithms.

The top reason is that tree-based algorithms simply don’t work very well with one-hot-encodings. Because they work based on partitioning, one-hot-encoding forces decision trees to sequester data points by individual categorical values — there’s no way for the model to say “if country == ‘USA’ or ‘UK’ then X”. If the algorithm wants to use “country”, there will have to be a “USA only” branch in the tree, a “UK only” branch in the tree, and so on.
It can either blow up memory consumption or force you to work with a sparse matrix.
It alters the shape of your data so the columns no longer coincide 1:1 with your original data frame or table.

An obstacle to computing feature importance

Think about mapping feature importance back to the original features after one-hot-encoding! It’s now very difficult to the point that you might just opt not to do it, or you may be forced to look at your feature importance very differently than you otherwise would have.

For example, instead of a feature importance for “country”, you’ll have a feature importance for “country=USA”, another one for “country=UK”, and so on — from which it will be hard to determine the relative importance of “country” as a whole unless you do some extra math and transformations on top of this fractured feature importance.

What if you want to do feature selection using feature importance? Will you drop individual values of categorical variables from your data frame? These will be the first to go when you filter on feature importance if you’ve done one-hot-encoding.

In other words, one-hot-encoding makes one hot mess!

Fortunately, there’s a simpler, faster, more accurate method.

Another class of encoding

There are many methods for encoding categorical data. One-hot-encoding, while it has its drawbacks, is ideal in one sense: You can essentially feed it into any algorithm, and the algorithm will have a reasonable chance at extracting signal from the categorical feature. This is especially true for models that use dot products or matrix multiplications, such as linear models, neural networks, and more.

Tree models, however, are much more flexible in how they can extract information from a single feature. Because tree models work by chopping up a given feature into partitions, a tree model is capable of carving the data up into segments that are defined by the categorical vector without having a one-hot-encoding handed to it.

A whole class of encodings is defined by mapping a categorical feature to a single numeric vector. There are many strategies for doing this, and there’s an opportunity here for a Machine Learning scientist to imbue the categorical representation with some extra usefulness through some clever thinking while engineering features.

A good example of this is target encoding.

Target encoding

With target encoding, we map the categorical to the mean value of the target given that categorical value. For example, if the mean of the target while “country==‘USA’” is 2.4, we can replace ‘USA’ with ‘2.4’. This can be an especially useful encoding method for converting the categorical data to something ordinal that is clearly relevant to the predictive task.

A caveat here is that it’s easy to overfit if you’re not careful. Some ways around this include using older historical data to compute the encoding (rather than your current training set) and doing stacking, among others.

An encoding like this, which is informed by the data, makes sense. However, you might be surprised to find out that a tree-based model can work exceedingly well with encodings that may seem at first glance like they wouldn’t be any good.

Unexpectedly good encodings

Here are some examples of these types of encodings:

Mapping categorical values to random floats.
Sorting categorical values alphabetically, then map to index in that list (ordinal encoding).
Mapping categorical values to their frequency in the data.
Mapping categorical values to their frequency rank in the data.
Or … try making up your own!

Amazingly, tree-based models are highly compatible with using encodings like this, which has been demonstrated repeatedly in data science competitions and in professional applications.

When do these encodings not work?

Here are some reasons these encodings may not work:

Collisions in the map. For example, frequency encoding may map some categorical values to the same exact number, rendering the model unable to make separate use of them.
Too many categorical values. If the information contained in the categorical feature is trapped “deeply” within the vector (i.e., it would require too many partitions to access), the tree may not find it.
Model not tuned properly to allow partitioning fine enough to zero in on specific categorical values. The model may need to be allowed to split very deeply to access information in the categorical feature, which may cause it to overfit to other features.
Luck. The encoding might simply — for whatever reason — obscure information from the splitting criterion and may prevent the tree from deciding to use the feature even though it has information content.

Other encodings

In order to circumvent some of these issues, there are yet other ways of encoding categorical data to help ensure that information can be extracted.

There is binary encoding (not the same as one-hot-encoding even though one-hot-encoding does produce binary features), which is a somewhat strange encoding that maps a categorical feature to an integer, and then maps that integer to its representation as a binary string; each binary digit is then “one-hot-encoded.” It’s a more efficient representation than one-hot-encoding, as it uses log_base_2(n) columns as opposed to n columns (where n is the cardinality of the categorical feature).

Dimensionality reduction applied to the one-hot-encoding — like PCA or random projection — is another alternative that may retain much of the information in the categorical feature while keeping the representation skinny and dense. You may be able to encode a fairly large sparse matrix as a small number of dense columns this way. (Let me say that this tends not to yield great results in my experience — it works, but it’s easily beaten by other methods.)

Binary encoding and projection-based methods do have the undesirable attribute that they may destroy the alignment between the raw data and the transformed input data, but they do retain density and so they can still be a good compromise if the other encodings aren’t working for some reason.

Experiments: Let’s compare encoding techniques

Let’s run some quick experiments to demonstrate the effectiveness of some of these methods.

First, let’s get our data together. I’d like to use several datasets so that we can check for consistency of results.

The five I’ve chosen are all classics.

abalone.csv: This regression dataset involves predicting the age of abalone based on physical measurements.
adult.csv: This classification dataset focuses on predicting whether a person’s income exceeds a certain threshold, based on various demographic and employment features.
bank.csv: In this classification dataset, known as the “Bank Marketing” dataset, the goal is to predict whether a client will subscribe to a term deposit with a bank.
mushroom.csv: This classification dataset involves classifying mushrooms as edible or poisonous based on various attributes.
titanic.csv: The classification dataset revolves around predicting survival on the Titanic based on passenger characteristics.

Loading up this data and formatting it consistently takes a little bit of work. Check out the notebook for code which takes care of this for you here.

For the encoders, I make use of the “category encoders” package, which has lots of great stuff in it.

Categorical encoding techniques available in the “Category Encoders” package.

From the documentation, you can see we have a ton of methods at our disposal. All use the same scikit-learn–style API for fitting and transforming our data (and in fact are scikit-learn transformers ).

For now, we’ll test the following five methods on a variety of datasets:

One-hot-encoding
Ordinal encoding
Binary encoding
Target encoding
Count encoding (frequency encoding)

Note: The Category Encoder package’s ordinal encoder (which we’re using here) maps categorical features to randomly selected integers. Scikit-learn also has an Ordinal Encoder class, but it maps categories to their rank in the alphabetically sorted list of categories.

On each dataset, we do a mini random hyperparameter optimization on XGBoost to attempt to find good parameters for each method. We do this because each encoding may be different enough that it requires different hyperparameter values to squeeze out all of the signal it has to offer; we don’t want to bias the results based on a set of hyperparameters that happens to work well for one encoding but not the others.

We perform 30 iterations of random hyperparameter optimization varying learning rate and max_depth while using early stopping on a validation set and evaluate results on a test set. The entire script is available here.

Once we iterate through all combinations of dataset and encoding method, we arrive at the following table of output scores with which we can compare the results.

Table of experiment results. Count Encoding is always either the best or tied for first.

Analysis

To interpret the results in this table, note the following:

For regression problems, we’re computing the mean squared error; lower is better.
For classification, the focus is on accuracy, so higher is better.

Best results among the datasets:

abalone: count encoding is best with the lowest RMSE, in this case of 4.717.
adult: count encoding is best with the highest accuracy, in this case of 87.6 percent.
bank: count encoding again wins with the highest accuracy, in this case of 91.5 percent.
mushroom: this dataset is too easy, as all receive 100 percent.
titanic: all tied again.

According to my experience and that of many others as well, “Count Encoding” a.k.a. “Frequency Encoding” is a very strong and robust technique. You can see that here: When the results aren’t tied, Count Encoding is slightly better than the other methods auditioned.

The differences in accuracy here are very small. On other datasets, they may be larger. But the point isn’t really about the magnitude of the difference. I’d say that these methods all can lead to models of nearly equivalent accuracy. The main differences are in the other aspects of these methods.

Specifically, count encoding:

Doesn’t force your matrix to explode in shape or memory requirements.
Doesn’t force you to use a sparse matrix.
Enables you to keep the shape of your matrix intact, and keeps things like “feature importance” meaningful and easy to map back to your original features.

On average, Count encoding is typically about as good as (or better than) one-hot-encoding, but it has the advantages listed above as well as simplicity of representation.

As we’ve seen, with tree-based models there’s just no reason not to choose a lower-dimensional encoding! It’s the way to go!

Conclusion

I’d like to repeat these experiments with more datasets and improve the depth of comparisons between methods. There’s a lot more to explore here.

I highly recommend that you use count/frequency encoding instead of one-hot-encoding when working with tree-based models unless you have a very good reason!

Thanks for reading!

Phillip Adkins is a Kaggle Master and an experienced data science leader. Find him on LinkedIn.