Nice read.
Peter Fennell
51

> However, your experiment has some major flaws which make me really doubt the conclusion — “no reason to choose one hot encoding”.

cf “The experimental design is the following” (…)

Reality is more or less it applies only on this specific experimental design, and over-generalization must be taken with caution.

> you define tree complexity solely in terms of tree depth

cf “The experimental design is the following” (…)

Actually, the choice is mainly due to the fact maximum depth is the most used and most important hyperparameter for boosted trees.

Adding more hyperparameters = making stuff more complicated. If used on this experimental design, plots would end up 6D instead of 5D, very difficult to read.

> your categorical variables are perfectly predictive. When is this ever the case in data science that one feature is perfectly predictive?

cf “The experimental design is the following” (…)

This will also happen when you find a physics law inside the dataset (which happens a lot). Same for experimental designs intended to allow it (such as this one).

> One-hot is obviously an inefficient representation for high cardinality.

It depends on the dataset and on the task. Sparse datasets enjoy easily such high cardinality, especially when used for linear models (or decision trees + regression per leaf).

> However, there is no evidence here to say that One-hot is a worse predictor (my experience is that Numerical encoding is the worse).

cf “The experimental design is the following” (…)

Each case must be tested one by one depending on the dataset, because every dataset will not react the same way to a specific encoding.

> I had never considered binary encoding, sounds pretty cool

Binary encoding is a strange (working) encoding in the sense it’s a compromise between one hot encoding and numeric encoding. It seems to work very well sometimes on Kaggle competitions. From my experience:

  • Numeric encoding is speedy but usually causes lower performance (associates with your past experience on that encoding)
  • One hot encoding is slow but usually provides better performance
  • Binary encoding is a compromise of both

Obviously, derivative encodings can be generated instead of using only binary encoding (such as all the contrasts available in design matrices, polynomial encodings, etc.).