Categorical Encoding: Key Insights

Summarizing 54 Medium Stories

8 min readJan 9, 2024

Categorical feature encoding converts categorical variables into a numerical format. It is a common step in preparing data for machine learning. If you search for this information on Medium, you’ll find a bunch of articles that all shed light on the topic. It’s quite a stack to wade through. So I synthesize these pieces into a coherent summary, with the goal of creating a clear guide that distills the wealth of information available and gives readers a shortcut to the insights they need. For anyone trying to navigate the various approaches to coding categories, this summary serves as a curated compass, pointing you to the reading that best suits your needs and sparing you a deep dive into the maze of Medium articles. It also serves as a review of related work in anticipation of a new comprehensive benchmark of categorical encoding methods that I plan to publish in one of my upcoming posts.

In what follows, I review the coverage of encoders by various Medium stories. I then summarize the different features of these stories, such as whether they provide code snippets or set up an experiment to evaluate the performance of encoders in ML tasks. Next, I summarize the recommendations given regarding which encoder to prefer. The last section concludes.

Table 1. Feature encoding methods covered by selected Medium stories

Categorical encoder coverage by different stories

Table 1 is an overview of 54 Medium stories. The top part, up to the Label Count encoder, shows how various Medium stories cover different encoders, 27 in total. The table sorts the encoders in rows by popularity, based on the number of stories discussing them, with the most popular at the top. The columns represent individual Medium stories, and I’ll provide their links at the end. The arrangement of the stories highlights those that cover a wider range of encoders by listing them first. The values “+” or “!” in the cells mean that the story covers the encoder, I’ll explain the difference between “+” and “!” shortly. To keep the width of the first column reasonable, I abbreviated some of the encoders and chose a specific name for the encoders that appear under different names in different stories. The table below shows my choices.

Table 2. Encoders and their full or alternative names

Some encoders are often confused with each other. For instance, the authors often use Label and Ordinal encoders as synonyms; in fact, the only difference between these two is that the latter requires explicit specification of the order of categories. Similarly Count encoder is often confused with Frequency encoder. In such cases, when the the story actually describes “encoder A” but names it as “encoder B”, I put “!” instead of “+” in the cell corresponding to “encoder B”. I have used Mean and Target encoder as synonyms, as most of the reviewed papers do. Note, however, that Target encoder does something slightly different than Mean encoder, and is more similar to smoothed or regularized Mean, see [10].

By a wide margin, the One-Hot and Label encoders are the most frequently mentioned, appearing in 51 and 45 Medium stories, respectively. The followers Mean and Ordinal encoders are much less popular with 28 and 25 mentions. The number of mentions of all other encoders does not exceed 20; all encoders below Dummy are mentioned less than in 9 stories.

Distinctive features of categorical encoding stories

The bottom part of Table 1, below the Label Count encoder, provides a deeper insight into the content of individual stories as follows.

The Paywall row indicates whether the story is behind a paywall at the time of this writing. As a result, some of the following cells in this column may contain “?”, meaning that I did not have access to the story to fill them in. If you have a paid account, you may want to visit [67]-[70] (see the links at the end), which are not part of my summary.
The Code row indicates whether a story contains examples written in a particular programming language. Most of the papers include code written in python (the value “p” in the Code row), two stories include code written in R and SAS (the values “R” and “S”, respectively).
Sometimes the code appears as an image, making it difficult to copy parts of it. If the code appears as text, the Can copy row contains a “+”.
Twelve stories complement the code in a notebook on Github. In these cases, I put the citation in the Repository row. The links corresponding to these citations are at the end of this text.
Some stories apply the encoders to one or more datasets. The Datasets row contains the number of datasets used. Here I use the convention that many synthetic example datasets (e.g. consisting of a feature with multiple values) are counted as one. The value “D” in this row means that multiple datasets were synthesized using a data generation process.
Also, if there is an ML model trained on the coded data, I show this with a “+” in the Train model row. These stories should be of particular interest to those looking for quickly deployable code snippets (when the LLM-produced code is not sufficient for some reason).
If an author discusses pros and cons of encoders, I indicate this with a “+” in the Pro/Contra row. This judgment is quite subjective, as there may be stories where a single disadvantage is mentioned for one of the encoders.
Finally, you may be interested to know if there are recommendations that favor one or more encoders over the others. I mark such cases with a “+” in the Winner(s) row and discuss them below.

Recommendations from the stories

Most stories postulate that there is no “one size fits all” encoder and that the choice should be made on a trial and error basis. Nevertheless, some stories make certain recommendations, as follows.

[4] states that the most stable and accurate encoders are target-based encoders with double validation: CatBoost encoder, James-Stein encoder, and Mean encoder;
according to [10], Label encoding, Mean encoding, WOE encoding, and model encoding (such as CatBoost) are commonly used;
according to [13] (and [15]), OHE is the most commonly used method;
according to [16], the most commonly used are OHE and Label encoding. These are also the most frequently mentioned encoders in the stories I looked at, see Table 1. [18] also recommends using these encoders for nominal and ordinal variables, respectively;
in [25]’s experiments, Label and k-fold Mean encoders worked well;
[26] recommends trying OHE, Hashing, LOO, and Mean for nominal variables; Ordinal, Binary, OHE, LOO, and Mean for ordinal variables; and avoiding Mean and LOO for regression tasks;
[35] supports Label and Binary encoders, but does not recommend OHE;
[38] states that OHE and Mean encoding are preferable solutions, and advises to avoid Label encoding;
in the analysis of [39], Mean encoding showed superior performance. They also defend Label and Frequency encoding;
the experiments of [40] support Frequency encoding with XGBoost;
finally, [45] suggests that 90% of the work can be done with Label encoding or OHE.

Table 3. Recommendations of some stories

Table 3 summarizes these statements, where “o” means the encoder is recommended, “c” means the story says the encoder is commonly used, and “x” means the story advises to avoid the encoder. As you see, there does not seem to be much agreement between the different recommendations. The most frequently reviewed encoders also often appear as recommended. So there might be a recommendation bias.

Conclusion

To summarize:

One-hot encoder and Label encoder are the most common encoders in the stories reviewed.
Many stories provide the code in Python, but only a few run an experiment to see the effect of encoding on ML model performance.
There is a lack of consistency in the recommendations made regarding which encoder to try first.

In one of the next stories, I will invite you to take a look at what the existing research says on this topic, and present a recent comprehensive categorical encoder benchmark.

References

[55] code
[56] code
[57] code
[58] code
[59] code
[60] code
[61] code
[62] code
[63] code
[64] code
[65] code
[66] code

[67] Dealing with Categorical Data
[68] An Easier Way to Encode Categorical Features
[69] Stop One-Hot Encoding Your Categorical Variables.
[70] Categorical Data Encoding Techniques in Python: A Complete Guide