A Deep Dive into Types of Categorical Encoding in Machine Learning

Regan Muthomi
4 min readMar 28, 2023

--

Real-world data comes in all shapes, forms, and types. However, most machine learning algorithms only accept numerical inputs. To build predictive machine models using these algorithms, we need to convert categorical variables in our data to numeric formats. This is achieved through categorical encoding.

Photo by Kevin Ku on Unsplash

Categorical encoding is the process of converting categorical variables into numerical format that can be processed by machine learning algorithms.

In this article, we are going to discuss two of the most common encoding methods:

  1. Ordinal Encoding.
  2. One-hot Encoding.

Before discussing these encoding methods, let’s take a minute to understand the different types of categorical variables that we might encounter in a dataset. Generally, there are two types of categorical variables: nominal variables and ordinal variables.

What’s the difference between these types categories?

Well, Ordinal variables have a natural hierarchy or order. For example, income bracket (high, middle, low) has a natural order because a person in the high income bracket earns more money than both medium and low, and medium income earners earn more than low income earners.

Nominal variables, on the other hand, do not have a natural hierarchy and all categories are considered equal. Examples of nominal variables include gender (male, female), country (Nigeria, Kenya, Ghana), and so on.

Nominal and Ordinal variables
In the example above Income_class is an ordinal variable while Gender is a nominal variable.

When encoding ordinal variables, it is important to retain the natural order between the categories. On the other hand, for nominal variables, we want to avoid introducing any hierarchy that would suggest unequal categories. Therefore, different encoding methods should be used for nominal and ordinal categorical variables.

Now let’s discuss encoding methods and when to use them.

  1. Ordinal Encoding

As the name suggests, ordinal encoding is used for ordinal variable encoding. It retains the natural hierarchy in the variable’s categories by converting them into ordinal integers (from 0 to n_categories).

Ordinal Encoding
Income_class variable has been ordinal encoded to Income_class(Encoded).

The illustration above shows that the income_class variable has been encoded to produce the Income_class(Encoded) while maintaining the natural hierarchy between the categories.

2. One-Hot Encoding

One-hot encoding, also referred to as dummy variable encoding, is a method used for encoding nominal variables. This method works by creating binary columns, or dummy variables, for each distinct category. If a variable has n distinct categories, the number of binary columns created will be n. (Sometimes you can remove one of the dummy variable so you are left with n-1)

Using a One-Hot Encoder the Gender variable produces two binary columns Gender_female and Gender_male.

The Gender column has two distinct categories, male and female. Therefore, the one-hot encoder has created two binary columns: Gender_male and Gender_female, each with only two distinct values, 0.0 or 1.0.

Let’s focus on the the first row to understand what is really happening here.

Row 1

In one-hot encoding, we check for the presence or absence of a category in each row. Since our first row has a value of ‘Male’ for the ‘Gender’ column, it means that ‘Female’ is absent in that row. Therefore, the ‘Gender_female’ dummy variable will have a value of 0.0 (absent) and the ‘Gender_male’ variable will have a value of 1.0 (present).

However, it is important to be careful when using one-hot encoding on variables with many distinct categories (high cardinality), as it can significantly increase the dimensionality of your input variables, which could have adverse effects on downstream analysis.

Conclusion

In this article, we have looked at two of the most common categorical encoding methods and their use cases. Of course, there are several other encoding methods that we will discuss in a future blog.

In scikit-learn, we can use the Ordinal encoder for ordinal encoding and the One-Hot encoder for one-hot encoding.

Thank you very much for reading this article. I hope you enjoyed reading it as much as I enjoyed writing it for you. I would appreciate any feedback you may have in the comments section or on my LinkedIn page.

Also feel free to say Hi✋ twitter

Gracias!

--

--

Regan Muthomi

Regan is a Data Scientist and an AI enthusiast, skilled in ML. I want to make learning data science and machine learning easy and accessible for all.