Simplifying Encoders: Choosing the Right One

Gia Ranjan
6 min readAug 15, 2023

--

source

I found an amazing article about encoders when I was new to machine learning and the thought of dealing with categorical variables was nothing short of a nightmare. The article (which you should really check out here if you are a beginner) talked about 15 different ways to convert a categorical variable to a number for Machine Learning Model Building.

And it was quite a good and comprehensive guide too. Except 15 different encoding methods — my brain just looked at the number and went ‘nope too many too confusing’.

So I referred to the above article and then created my own (very simple) notes at a kid-friendly difficulty level about the common cases when each encoder would be a good choice. This made it much easier to absorb those concepts — check out this article where I explain this ‘kid-friendly’ method in a bit more detail. Hope this helps anyone else who might be struggling!

I will be covering the following encoding methods:

  1. One Hot Encoding
  2. Label Encoding
  3. Ordinal Encoding
  4. Helmert Encoding
  5. Binary Encoding
  6. Frequency Encoding
  7. Mean Encoding
  8. Weight of Evidence Encoding
  9. Probability Ratio Encoding
  10. Hashing Encoding
  11. Backward Difference Encoding
  12. Leave One Out Encoding
  13. James-Stein Encoding
  14. M-estimator Encoding
  15. Thermometer Encoder

(1) One Hot Encoding: Use when the categorical variable has no ordinal relationship, and you want to create a binary column for each category. It’s common but can lead to dimensionality problems if the category count is high.

Kid-friendly explanation with examples

  • Use when: You have categories like animal types, and there’s no order to them.
  • Simple explanation: Imagine you have three animals: Cat, Dog, and Bird. One Hot Encoding makes three buckets, and if you see a Cat, you put a ball in the Cat bucket only. The Dog and Bird buckets stay empty.

(2) Label Encoding: Apply when the categorical variable has ordinal relationships, meaning there is a clear ranking between the categories. It can be misinterpreted by the model if there is no ordinal relationship.

Kid-friendly explanation with examples

  • Use when: You have categories that have a clear order, like Small, Medium, Large.
  • Simple explanation: If you have three sizes of shirts, you can give numbers like 1 for Small, 2 for Medium, and 3 for Large.

(3) Ordinal Encoding: Similar to Label Encoding but explicitly for ordinal categories. Use it when the categorical feature’s order matters.

Kid-friendly explanation with examples

  • Use when: You have things that can be put in order, like grades in school.
  • Simple explanation: Just like getting grades A, B, and C, you give numbers 1 for A, 2 for B, and 3 for C.

(4) Helmert Encoding: Helpful when comparing each level of a categorical variable to the mean of the subsequent levels. It can provide insight into contrasts between different levels.

Kid-friendly explanation with examples

  • Use when: You want to compare one thing to others.
  • Simple explanation: Imagine lining up your toys and comparing each one to the rest.

(5) Binary Encoding: Good for reducing dimensionality in a category with numerous classes. It represents categories in binary code, thus utilizing fewer dimensions than One Hot Encoding.

Kid-friendly explanation with examples

  • Use when: You have many different things like colors, and you want to name them simply.
  • Simple explanation: Like using secret code to name your colors. Red can be 001, Blue 010, and so on.

(6) Frequency Encoding: Use when the frequency of categories can be a useful feature. It replaces categories with their frequencies or count.

Kid-friendly explanation with examples

  • Use when: You want to know how many times something appears, like your favorite fruit.
  • Simple explanation: If you like apples most and eat 5, and bananas only 2 times, then apples get the number 5, and bananas get 2.

(7) Mean Encoding: Can be used when you want to replace categories with the mean target value for that category. Be cautious as it can lead to data leakage; proper validation is required.

Kid-friendly explanation with examples

  • Use when: You want to give something a score based on how often it wins or loses.
  • Simple explanation: Imagine playing a game with your friends. If you win 8 games out of 10, you give winning a score of 0.8, and losing gets a score of 0.2.

(8) Weight of Evidence Encoding: Helpful in binary classification problems. It measures the predictive power of a variable in relation to the dependent variable. How much a category supports or goes against a conclusion.

Kid-friendly explanation with examples

  • Use when: You want to see how much something helps or hurts your chances, like predicting rain.
  • Simple explanation: If you have 100 days, and on 70 of them, it rains when clouds are grey, but doesn’t rain on 30 of those days, the grey clouds get a number that shows they mostly mean rain but sometimes don’t.

(9) Probability Ratio Encoding: Used in binary classification where the categories are replaced with the probability ratio of the dependent event.

Kid-friendly explanation with examples

  • Use when: You want to see the chance of something happening, like winning a game.
  • Simple explanation: If you win 3 out of 5 games, the winning gets the number 3/2 — which means you win 3 games for every 2 games you lose.

(10) Hashing Encoding: Good for handling large datasets with high cardinality features. It uses a hash function to represent categories.

Kid-friendly explanation with examples

  • Use when: You have lots and lots of things like toy cars, and you want to put them into fewer groups.
  • Simple explanation: You can put all red toys in one box, all blue in another, and so on.

(11) Backward Difference Encoding: Can be applied when you want to compare the mean of the dependent variable for one category to the mean of the preceding category.

Kid-friendly explanation with examples

  • Use when: You want to compare one thing to the one before it.
  • Simple explanation: Like comparing your new toy with the last one you got.

(12) Leave One Out Encoding: Works well when you want to calculate the mean of the dependent variable for all categories except the current one. This method is typically used within the context of cross-validation to prevent data leakage, especially during training. Care should be taken to avoid leakage.

Kid-friendly explanation with examples

  • Use when: You want to find out something, but you don’t count the one you’re looking at.
  • Simple explanation: Like guessing your average score, but without counting the game you’re playing now.

(13) James-Stein Encoding: Useful in reducing overfitting in small datasets or categorical variables with many levels.

Kid-friendly explanation with examples

  • Use when: You have many things to pick from but don’t want to get confused.
  • Simple explanation: Like picking the best flavor of ice cream but only tasting a few.

(14) M-estimator Encoding: Robust to outliers and suitable for regression tasks, especially when the categorical variable has many categories.

Kid-friendly explanation with examples

  • Use when: You want to find the average but don’t let the very high or low numbers trick you.
  • Simple explanation: Like counting your jumping score but ignoring the time you jumped very high or fell.

(15) Thermometer Encoder: Used to encode ordinal data by creating binary columns for each category except the first, cumulatively marking ones. It captures ordinal information without creating too many new features.

Thermometer encoding is typically used to represent ordinal data by marking a series of binary flags up to the rank of the item (like the mercury rising in a thermometer).

Kid-friendly explanation with examples

  • Use when: You have things in order, like temperature levels.
  • Simple explanation: Imagine a thermometer with four levels: Cold, Cool, Warm, and Hot. If the temperature is at the “Warm” level, you color in the areas for Cold, Cool, and Warm. If it’s “Cool,” you color in the areas for Cold and Cool only. It’s like coloring in each level up to the one you’re at, just like how a real thermometer’s mercury rises to show the temperature.

Keep in mind that these explanations are, of course, very simplified, but I hope they provide a more tangible way of understanding these concepts.

--

--