Categorical Encoding in Machine Learning: A Guide to Label Encoding and One-Hot Encoding

Luiz Gabriel Bongiolo
6 min readMar 2, 2023

--

Most of the work in machine learning is done before we actually fit the model. I believe we can include categorical encoding into one the main tools of Feature Engineering techniques that will help to improve the results of a machine learning model. The reason why we need to convert categorical variables into numeric is to make it more “understandable” for the machine learning models.

Machine learning models rely heavily on numerical data, but often, datasets include categorical variables that represent qualitative features such as colors, regions, or types of products. Categorical data cannot be used directly in most of the machine learning algorithms, which are designed to work with numerical values. Therefore, the process of transforming categorical data into numerical form is called categorical encoding. In this article, we will discuss the common types of categorical encoding and describe the two main methods, Label Encoding and One-Hot Encoding.

The Issue

One of the significant challenges in machine learning is handling categorical data, especially when the number of categories is large. Categorical data can cause issues in algorithms that expect numerical inputs, as the model may interpret categorical data as ordinal or numeric values, which can negatively affect the performance and accuracy of the model.

Categorical Coding

Categorical coding is the process of converting categorical data into numerical values. There are several methods of categorical encoding, each with its own advantages and disadvantages. The most common types of categorical encoding are:

  • Label Encoding
  • One-Hot Encoding
  • Ordinal Encoding
  • Count Encoding
  • Target Encoding

I would guess that 90% of the work can be done using the first two from the list. The other 10% are used in very specific cases or if you really want to dig deep into every possibility you can explore in your model.

Label Encoding

This is a categorical encoding method that assigns a unique integer value to each category. This method is used when the categories have an inherent order or rank. For instance, “small,” “medium,” and “large” can be encoded as 1, 2, and 3.

Before working on the code always check missing values, inconsistencies and so on…

Label Encoding using Pandas:

import pandas as pd 

df = pd.read_csv("creditcard.csv")

This is our data frame, let’s check which variables are categorical with dtypes.

df.dtypes 

We can see here that [“age”] is being considered an object, but we see numeric values in the data frame. What’s going on?

df["age"] = pd.to_numeric(df["age"])

ValueError: Unable to parse string “?” at position 32

So python is actually identifying these values as object due to missing data that was replace with a question mark “?”. Fix that and move on.

Let’s convert the column [“education_level”] to categorical using Label Encoding with Pandas and Scikit-Learn.

To be able to perform the task in just one line of code, we need to convert the object to “category” using the astype function and then we can use .cat.codes function to encode the data.

df["education_level"] = df["education_level"].astype("category")
df.dtypes

Now we are ready to convert [“education_level”] using Pandas:

df["education_level_cat"] = df["education_level"].cat.codes

Note that we have created a new column with the categorical data, you can perform this task directly in the original column but it’s always good to check if the task was performed correctly by comparing the output on the new column with the original column.

Pandas will attribute the hierarchy based on the categories alphabetical order as seen bellow.

df.education_level.unique()
df.education_level_cat.unique()

The hierarchy attributed by Pandas does not necessarily reflect the real life hierarchy of the education level. We will cover in the next article to understand if this has an impact on the model’s performance or not.

If you wish to attribute the hierarchy manually, here’s one way of doing it:

import numpy as np


df["education_level"] = np.where(df.education_level == 'high school', 1,
np.where(df.education_level == 'college', 2,
np.where(df.education_level == 'graduate', 3,
np.where(df.education_level == 'post-graduate', 4,
np.where(df.education_level == 'phd', 5, 0)))))

Label Encoding with Scikit-Learn:

Let’s perform the same task using Scikit-Learn, this time directly on the target column, and let’s compare to the results we got doing the task using Pandas.

from sklearn.preprocessing import LabelEncoder

label_or = LabelEncoder()
df["education_level"] = label_or.fit_transform(df["education_level"])

df

Both methods provided the same results. [“education_level”] column, that was transformed using Scikit-Learn and [“education_level_cat”] that was transformed using Pandas.

One-Hot Encoding

One-Hot Encoding, also known as Dummy Encoding, creates a binary column for each category, and each observation or row is assigned a 1 or 0 in each category’s column, indicating the presence or absence of the category. For instance, suppose there are three categories: “red,” “green,” and “blue.” In that case, One-Hot Encoding will create three columns, where the value 1 indicates the presence of that category and 0 indicates the absence.

In this scenario, the card provider doesn’t necessarily has a hierarchy. We must check how many unique values there is in the column we target with One-Hot encoding because this method will create a new column for each category.

First let’s check how many unique values the column [“card_provider”] has:

df.card_provider.unique()

One-Hot Encoding can be implemented in Python using the Pandas library as shown below:

pd.get_dummies(df, columns=["card_provider"]).head(10)

Keep in mind that if we are working with a dataset that has a large amount of categories, we can end up creating more columns than the length of our data frame. This can be an issue for the machine learning models.

Bonus Trick

If you wish to transform all columns containing string classes into numeric values at once, we can perform the same task using a lambda function along with the factorize function as shown below:

cols = ["education_level","marital_status","income_category","card_provider"]
df1[cols] = df1[cols].apply(lambda x: pd.factorize(x)[0] + 1)

This method will work as the label encoding method, the only difference is that the numerical categories will be transformed to numeric starting at 1 instead of 0.

Conclusion

Categorical encoding is an essential step in preprocessing categorical data before feeding it into a machine learning model. It is crucial to choose the appropriate encoding method based on the data’s nature and the model’s requirements. Label Encoding and One-Hot Encoding are two popular encoding methods used to convert categorical data into numerical form. By understanding these methods and how to implement them, data scientists can improve the performance and accuracy of their machine learning models.

--

--