Visiting: Categorical Features and Encoding in Decision Trees

Published in

Data Science & Design

10 min readApr 23, 2017

Full source code for the experiment and plots can be found here: https://github.com/Laurae2/CategoricalAnalysis

When you have categorical features and you are using decision trees, you often have a major issue: how to deal with categorical features?

Often you see the following…:

Postponing the problem: use a machine learning model which handle categorical features, the greatest of solutions!
Deal with the problem now: design matrix, one-hot encoding, binary encoding…

Usually, you WILL want to deal with the problem now, because if you postpone the problem, it means you already found the solution:

You do not postpone problems because in Data Science, they accumulate quickly like hell (good luck remembering every problem encountered, then come back 1 month later without thinking about them and recite them each).
You do not postpone problems without knowing the potential remedy afterwards (otherwise, you might have a working pipeline but no solution to solve it!).

So what is the matter? Let’s go back to the basics of decision trees and encoding, then we can test some good stuff… in three parts:

Machine Learning Implementations: specifications differ
Example ways of Encoding categorical features
Benchmarking Encodings versus vanilla Categorical features

Decision Trees and Encoding

Machine Learning Specification

When using decision tree models and categorical features, you mostly have three types of models:

Models handling categorical features CORRECTLY. You just throw the categorical features at the model in the appropriate format (ex: as factors in R), AND the machine learning model processes categorical features correctly as categoricals. BEST CASE because it fits your needs.
Models handling categorical features INCORRECTLY. You just throw the categorical features at the model in the appropriate format (ex: as factors in R), BUT the machine learning model processes categorical features incorrectly by doing wizardry processing to transform them into something usable (like one-hot encoding), unless you are aware of it. WORST CASE EVER because it does not do what you expected to do.
Models NOT handling categorical features at all. You have to preprocess manually the categorical features to have them in an appropriate format for the machine learning model (usually: numeric features). But how do you transform (aka ENCODE) them?

We will target specifically the third type of model, because it is what we want to assess. There are many methods to encode categorical features. We are going to check three of them: numeric encoding, one-hot encoding, and binary encoding.

Categorical Encoding Specification

Categorical Encoding refers to transforming a categorical feature into one or multiple numeric features. You can use any mathematical method or logical method you wish to transform the categorical feature, the sky is the limit for this task.

Numeric Encoding

Are you transforming your categorical features by hand or are you doing the work with a computer?

Numerical Encoding is very simple: assign an arbitrary number to each category.

There is no rocket science for the transformation, except perhaps… how do you assign the arbitrary number? Is there a simple way?

The typical case is to let your favorite programming language do the work.

For instance, you might do like this in R…

my_data$cat_feature <- as.numeric(as.factor(my_data$cat_feature))

Such as this:

as.numeric(as.factor(c("Louise",
                       "Gabriel",
                       "Emma",
                       "Adam",
                       "Alice",
                       "Raphael",
                       "Chloe",
                       "Louis",
                       "Jeanne",
                       "Arthur")))

This works, this is not brainer, and it encodes the way it wants deterministically (check the ordering and you will see).

One-Hot Encoding

One-Hot Encoding is just a design matrix with the first factor kept. A design matrix removes the first factor to avoid the matrix inversion problem in linear regressions.

Ever heard about One-Hot Encoding and its magic? Here you have it: this is a design matrix where you keep the first factor instead of removing it (how simple!).

To put it clear, just check the picture as it talks for itself better than 1,000 words.

In addition to thinking about what One-Hot Encoding does, you will notice something very quickly:

You have as many columns as you have cardinalities (values) in the categorical variable.
You have a bunch of zeroes and only few 1s! (one 1 per new feature)

Therefore, you have to choose between two representations of One-Hot Encoding:

Dense Representation: 0s are stored in memory, which ballons the RAM usage a LOT if you have many cardinalities. But at least, the support for such representation is typically… worldwide.
Sparse Repsentation: 0s are not stored in memory, which makes RAM efficiency a LOT better even if you have millions of cardinalities. However, good luck finding support for sparse matrices for machine learning, because it is not widespread (think: xgboost, LightGBM, etc.).

Again, you usually let your favorite programming language doing the work. Do not loop through each categorical value and assign a column, because this is NOT an efficient at all. It is not difficult, right?

Example in R, “one line”!:

model.matrix(~ cat + 0,
             data = data.frame(
               cat = as.factor(c("Louise",
                                 "Gabriel",
                                 "Emma",
                                 "Adam",
                                 "Alice",
                                 "Raphael",
                                 "Chloe",
                                 "Louis",
                                 "Jeanne",
                                 "Arthur"))))

Dense One-Hot Encoding in R example. As usual, the specific order is identical to the numeric version due to as.factor choosing the order arbitrarily!

If you are running out of available memory, what about working with sparse matrices? Doing it in R is no brainer in “one line”!

library(Matrix)
sparse.model.matrix(~ cat + 0,
                    data = data.frame(
                      cat = as.factor(c("Louise",
                                        "Gabriel",
                                        "Emma",
                                        "Adam",
                                        "Alice",
                                        "Raphael",
                                        "Chloe",
                                        "Louis",
                                        "Jeanne",
                                        "Arthur"))))

Sparse One-Hot Encoding in R. There is no difference to the Dense version, except we end up with a sparse matrix (dgCMatrix: sparse column compressed matrix).

Binary Encoding

The objective of Binary Encoding… is to use binary encoding to hash the cardinalities into binary values.

By using the power law of binary encoding, we are able to store N cardinalities using ceil(log(N+1)/log(2)) features.

It means we can store 4294967295 cardinalities using only 32 features with Binary Encoding! Isn’t it awesome to not have those 4294697295 features from One-Hot Encoding? (how are you going to learn 4 billion features in a decision tree…? you need a depth of 32 and it is not readable…)

Still as easy in (base) R, you just need to think you are limited to a specified number of bits (will you ever reach 4294967296 cardinalities? If yes, get rid of some categories because you got too many of them…):

my_data <- c("Louise",
             "Gabriel",
             "Emma",
             "Adam",
             "Alice",
             "Raphael",
             "Chloe",
             "Louis",
             "Jeanne",
             "Arthur")
matrix(
  as.integer(intToBits(as.integer(as.factor(my_data)))),
  ncol = 32,
  nrow = length(my_data),
  byrow = TRUE
)[, 1:ceiling(log(length(unique(my_data)) + 1)/log(2))]

Ugh, the formula is a bit larger than expected. But you get the idea:

Three key operations to perform for binary encoding.

Operation 1: convert my_data to factor, then to integer (“numeric”), then to numeric binary representation (as a vector of length 32 for each observation), then to integer (“numeric”).
Operation 2: convert the “numeric” to a matrix with 32 columns and the same number of rows as the number of original observations.
Operation 3: using the inverse of the binary power property (ceil(log(N+1)/log(2))), remove all the unused columns (the columns with zeroes).

There are, obviously, easier ways to do this. But I am doing this example to show you can do this in base R. No need fancy package stuff.

Benchmarking Performance of Encoding

We are going to benchmark the performance of four types of encoding:

Categorical Encoding (raw, as is)
Numeric Encoding
One-Hot Encoding
Binary Encoding

We will use rpart as the decision tree learning model, as it is also independent to random seeds.

The experimental design is the following:

We create datasets of one categorical feature with 8 to 8,192 cardinalities (steps of power of 2).
We use 25% or 50% of cardinalities as positive labels to assess performance of the decision tree. This means a ratio of 1:3 or 1:1.
We run 250 times each combination of cardinalities and percentage of positive labels to get a better expected value (mean) of performance.
To speed up the computations, we are using 6 parallel threads as One-Hot Encoding is computationally intensive.
The rpart function is limited to a maximum depth of 30 for practical usage, and used with the following parameters:

rpart(label ~ .,
      data = my_data,
      method = "class",
      parms = list(split = "information"),
      control = rpart.control(minsplit = 1,
                              minbucket = 1,
                              cp = 1e-15,
                              maxcompete = 1,
                              maxsurrogate = 1,
                              usesurrogate = 0,
                              xval = 1,
                              surrogatestyle = 1,
                              maxdepth = 30))

Warning: remember we are doing this on a synthetic dataset with perfect rules. You may get contradictory performance on real world datasets.

Sneak Peak: how are the decision trees looking?

For the sake of example, with 1024 categories and 25% positive labels.

Categorical Encoding

Numeric Encoding

One-Hot Encoding

Binary Encoding

One can understand it as the following:

Categorical encoding well… equality rules so it’s easy to split.
Numeric encoding requires splitting itself, if it splits 30 times in a row on the same branch then it’s over. If a split is frozen afterwards, then it is also over for that branch. Therefore, lot of RNG shaping!
One hot encoding requires.. as many splits as there are categories, which means a crazy lot and so much it stops very quickly because if you do one split, one part of the split will be frozen (because it is a perfect rule dataset) => it gives this escalator-like shape, thus very poorly performing.
Binary encoding has less than 30 features in all my cases, therefore each tree should be able to depict all the rules (theory is true, practice is wrong because you need splits to not close on themselves, which is not possible in theory, but possible in practice) => it gives this right-tailed tree shape. When the unbalancing increases, the performance increases because it requires a lower expected value (mean) of splits to perform a perfect split than a perfectly balanced case.

General Accuracy of Encodings

Without looking in depth into the accuracy, we are going to look quickly at the general accuracy of the four encodings we have.

On the picture, we can clearly notice a trend:

Categorical Encoding is the clear winner, with an exact 100% Accuracy at any time.
Numeric Encoding is doing an excellent job as long as the number of cardinalities is not too large: from 1,024 cardinalities, its accuracy falls off drastically. Its accuracy over all the tests is around 92%.
One-Hot Encoding is doing an excellent job like Numeric Encoding, except it falls down very quickly: from 128 cardinalities, performing consistency worse than Binary Encoding. Its accuracy over all the runs is around 80%.
Binary Encoding is very consistent in performance but not perfect, even with 8,192 cardinalities. Its exact accuracy is approximately 91%.

Balancing Accuracy of Encodings

To look further at the details, we are splitting the balancing ratio (25% and 50%) of the positive label to check for discrepancies.

We are clearly noticing one major trend:

The more the dataset is unbalanced, the more accurate the encodings become.
Inversely: the more the dataset is balanced, the less accurate the encodings become.

We can also notice our Numeric Encoding is getting worse faster versus Binary Encoding when the dataset becomes more balanced: it becomes worse from 1024 cardinalities on a perfectly balanced dataset, while it becomes worse only from 2048 cardinalities on a 1:3 unbalanced dataset.

For One-Hot Encoding, it is even more pronounced: worse 256 cardinalities on a perfectly balanced dataset, to 128 cardinalities on a 1:3 unbalanced dataset.

In addition, there seems to be no reason to use One-Hot Encoding over Numeric Encoding according to the picture.

In addition, there are three specific trends we can capture:

Numeric Encoding is not giving consistent results, as the results are flying around a bit (compare the box plot sizes and you will notice it). It seems it requires a lot more cardinalities to converge to a stable performance. It depicts also more consistency “as the balancing becomes more unbalanced”.
One-Hot Encoding is extremely consistent. So much consistent there is not much to say about if you want to approximate the predictive power of a categorical feature when alone.
Binary Encoding gets consistent when the cardinality increases, while maintaining a stable performance. It depicts also more consistency “as the balancing becomes more unbalanced”. Perhaps someone has a mathematical proof of convergence towards a specific limit, given the cardinality and the balancing ratio?