# Dealing with Categorical Data

**What is Categorical Data?**

The variable is considered categorical, when it describes some qualitative property and there is a limited number of option for its values.

*Here are some examples:*

- Weather conditions: “sun”, “rain”, “overcast”, “snow”, etc.
- Car manufacturers: “Toyota”, “Ford”, “Mercedes”, etc.
- Exam grades: “A”, “B”, “C”, “D” and “F”

Example 3 can be classified as *ordinal* variable meaning you can order its values in a meaningful way: “A” is better than “B” and so on. In other words, *ordinal* variable allows you to compare its values with each other.

On the other hand, we cannot order (or compare) values from examples 1 and 2 in a similar way. Such variables are called *nominal*.

**Encoding Categorical Data**

Many machine learning algorithms cannot handle categorical variables as they require all inputs to be numerical. For this reason, we need to find a way for transforming values of categorical variable to numbers. The general idea behind the most popular methods for encoding categorical data is to come up with a mapping of each category to numbers.

In the Figure 1 the mapping is completely random. However, we want to give to our machine learning algorithm a better sense of our data in order to make more accurate results. Thus, let’s discuss some more smart approaches than random encoding.

# 1. Encoding Ordinal Data

Ordinal data is a piece of cake to encode. Let’s recall the example 3 from the above (exam grades). We can keep the same relationship between different grades if we assign larger numbers to better grades. One of the possible mappings for this case is: “A” → 5; “B” → 4; “C” → 3; “D” → 2; “F” → 1.

Note, you can also assign smaller numbers for better grades and it basically doesn’t matter, as long as you are consistent with the encodings.

# 2. Encoding Nominal Data

Nominal data is more challenging to encode. We will discuss several methods in the order of increasing complexity, so feel free to stop whenever you feel like :)

## 2.1 One-Hot Encoding

One-Hot encoding is one of the most popular methods for encoding categorical data. For each unique category it creates a separate column with binary values (0 or 1). In each of these new columns there is a value of 1 if the corresponding category is present in this row and 0 otherwise. Too convoluted, right? Let’s jump to example!

These method has several disadvantages for machine learning algorithms:

- Data matrix of dummy variables is sparse (much more zeros than ones)
- Categorical features with high cardinality (number of unique categories) can blow your PC memory (or at least slow down your code), because for each category a separate column will be created.

## 2.2 Label Encoding

This method maps a single categorical column to just one numeric column which is a great advantage in comparison to One-Hot encoding.

In label encoding every distinctive category is mapped to some arbitrary number. Thus, this approach does not preserve any relationship between values in categorical variable. The simplest way to implement it is to enumerate all unique categories and use these numbers as encoded values.

You can also use *sklearn.preprocessing* package or pandas *factorize* method to achieve the same encoding. The only difference can be seen when there are NaN values presented in the data:

- Aforementioned example with “Manual Way” will preserve NaN values in the encoded data.
- Pandas factorize method will encode NaN values as -1.
- sklearn.preprocessing package will break when encounter a NaN value.

Approaches 1 and 2 give you a chance to fix NaN values later e.g. via imputing them, assigning it to a separate category (e.g. “missing”) or simply dropping it.

Notice that the numbers assigned to different categories in Snippet 4 and Snippet 5 are slightly different. But conceptually they are the same, because as we mentioned above, the goal of label encoding is to map every distinctive category to some **arbitrary** number.

## 2.3 Target Encoding

Numbers obtained by label encoding can mislead machine learning model. Let’s take a look at the Snippet 5. The encodings have values from 0 to 2 (remember, this assignment was random). There is a trivial relationship between them: 2>1>0. But what does it mean for Mercedes to be “greater” than Ford? Well, you may argue that Mercedes is on average more expensive than Ford, but what if we had “apple” and “orange” categories instead? Right, this makes no sense. Thus, we want to encode our categories in a more meaningful for machine learning model way.

As we cannot generate additional information from the thin air (ohhh, right, besides *frequency encoding*…), we will need a target (or response) variable that needs to be predicted.

Let’s say we have a dataset describing various neighborhoods across US and our goal is to predict an average household income. One of the features in the dataset is State (which is a categorical variable). And our target variable is numeric. The general idea behind target encoding is that categories which correspond to larger values of target variable will be encoded with larger numbers.

Wow, there were 3 paragraphs of text without examples… Let’s fix it!

As you can see in Snippet 6, after examining the data it looks like households in California (CA) have the highest average household income and in Texas (TX) it is the lowest among the states presented in our dataset. As you can see from the encodings, they indicate similar relationship: CA category gets the highest encoding value, TX gets the lowest, and NY gets the intermediate one.

For other algorithms similar to target encoding check out this link (there are quite a few of them, and I would recommend to check at least the CatBoost encoder).

# Final Notes

While encoding categorical data, you should always keep in mind the following issues that may arise:

- There is a chance that in test set you will face with a category that was not present in the train set.
- Target encoding can lead to data leakage. To avoid that, fit the encoder only on your train set, and then use its transform method to encode categories in both train and test sets.

You may be wondering, what should you do if you want to use target encoding for classification problem. There are approaches similar to aforementioned ones for classification tasks as well. The easiest one is to transform the target variable to numeric format (potentially using one of the methods described in this article) and then use the target encoding as you would do for regression problem.

Thanks for reading! Feel free to reach out to me on LinkedIn.