Data Encoding: A Brief Overview of Different Methods in the Literature

Joaquin Caqueo
5 min readMay 5, 2023

--

Image from the data set, source: Kaggle⁴

Almost all machine learning algorithms require numerical values as inputs. For instance, consider the case of determining a student’s performance based on specific features such as gender, ethnicity, parental level of education, etc. Let’s analyze the race/ethnicity feature. We know that the possible values are “group A”, “group B”, “group C ”, “group D,” and “group E”. Therefore, before the machine learning algorithm can process the data, you must convert these values into numeric form. This is because machine learning models require all input and output variables to be in numeric format.

If your dataframe contains categorical data, you need to encode it into numeric values before fitting and evaluating a model. There are various methods available to accomplish this task, including One-Hot, Ordinal, or Binary encoding. In the next section, we will review the most commonly used methods for encoding data.

One-Hot Encoding

We use the One-Hot Encoding technique for categorical data when the features are nominal, meaning they do not have any order. For each level of a categorical feature, we create a new variable, and each category is mapped with a binary variable containing either 0 or 1. Here, 0 represents the absence, and 1 represents the presence of that category.

Continuing with the previous example, the ‘ethnicity’ column consists of categorical data. Therefore, this encoding method will create n new columns, each representing the absence or presence of a specific category in the original column.

One-hot encoding, source: Author

As you can see, a disadvantage of this method is that increases the complexity of the data by adding multiple rows in the process.

Label Encoding (Ordinal Encoding)

In cases where retaining the order of categorical data is important, it is necessary to use an encoding technique that reflects the sequence of values. Label Encoding is one such technique where each label is converted into an integer value. However, a disadvantage of Label Encoding is that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy or order in them, even though there may not be any.

Label encoding, source: Author

Dummy Encoding

Dummy Encoding is similar to one hot encoding, just with the difference that Dummy Encoding create N-1 new features. This method has a advantage, that decrease the complexity of de data.

Dummy encoding; source: Towards Data Science⁵

Similar to one-hot encoding, dummy encoding has the disadvantage of increasing the dimensionality and complexity of the dataset.

Hash Encoding

Hash Encoding involves applying a hash function to the categorical values, which generates a fixed-length representation for each category. The resulting hash values are used as the encoded features for the categorical variables.

Hash method, source: Wikipedia

One of the advantages of hash encoding is that it can handle high-cardinality categorical variables, i.e., variables with a large number of distinct categories, without generating a very high number of output features. However, different categories may be mapped to the same hash value, which can lead to information loss and reduced accuracy in some cases.

Binary Encoding

Binary encoding is a technique used in machine learning for encoding categorical variables. It involves representing each unique category as a binary string of fixed length, where each bit of the string represents a feature or column in the encoded dataset. Each category is assigned a unique binary code, which is obtained by converting the integer representation of the category into its binary equivalent.

For example, we can return to our categorical variable “ethnicity” with their categories: “group A”, “group B”, “group C ”, “group D,” and “group E”. To encode this variable using binary encoding, we first assign each category a unique integer code: “group A” = 0, “group B” = 1, “group C” = 2, “group D” = 3 and “group E” = 4. Then we convert these integer codes into their binary representation: “0” = 000, “1” = 001, and “2” = 010, “3” = 011, “4” = 100. Finally, we represent each category as a binary string of length 3, where the i-th bit of the string corresponds to the i-th digit in the binary code of the category.

The resulting binary-encoded dataset has fewer columns than a one-hot encoded dataset, which can reduce memory usage and computation time in some cases. However, binary encoding is not suitable for high-cardinality categorical variables, i.e., variables with a large number of distinct categories, since the number of output features will still be relatively high. Additionally, binary encoding assumes that the categories have an ordinal relationship, which may not be true in all cases.

Target Encoding

Target Encoding consist in calculate the mean of the target variable for every category. We then replace each category in both the training and validation sets with its corresponding mean value. This results in a new encoded feature that represents the average target value for each category.

One potential issue with target encoding is that it can lead to overfitting if the categories in the training set are not representative of the categories in the validation or test sets. To address this, techniques such as smoothing or regularization can be applied to the target encoding process to reduce overfitting and improve generalization performance.

Conclusion

We have reviewed a few methods for encoding categorical data into numerical values. It’s worth noting that there are many other methods available for this task, but the most commonly used ones are One-hot Encoding and Label Encoding. When deciding which method to use, it’s important to consider the advantages and disadvantages of each. Therefore, the choice will depend on the expertise and experience of the data scientist.

References

  • [1] Seger, C. (2018). An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing.
  • [2] Dahouda, M. K., & Joe, I. (2021). A deep-learned embedding technique for categorical features encoding. IEEE Access, 9, 114381–114391.
  • [3] Potdar, K., Pardawala, T. S., & Pai, C. D. (2017). A comparative study of categorical variable encoding techniques for neural network classifiers. International journal of computer applications, 175(4), 7–9.
  • [4] Kiattisak Rattanaporn. (2023, March). Student performance prediction, Version 1. Retrieved May 20, 2023 from https://www.kaggle.com/datasets/rkiattisak/student-performance-in-mathematics.
  • [5] Pramoditha, R. (2022, January 4). Encoding Categorical Variables: One-hot vs Dummy Encoding. Medium. https://towardsdatascience.com/encoding-categorical-variables-one-hot-vs-dummy-encoding-6d5b9c46e2db

--

--