How to Handle Categorical Features

Ashutosh Sahu
Analytics Vidhya
Published in
5 min readMar 20, 2021

Hello Everyone!!! Welcome to my blog. If you are dealing with the data, I am 110% certain that you have seen data with categories for instance Gender (Male, Female) or Education (Ph.D., Master’s, Bachelor’s). Since we are dealing with a mathematical model in machine learning it is significant that we can convert this category into numeric numbers prior to utilizing it for training our model.

In this blog, we’ll look at what categorical variables are and the various types of them, as well as different approaches to handling categorical data with code samples.

All the code samples and datasets are available here.

Categorical Data and Its Types

A categorical or discrete variable is one that has two or more categories (values). There are two different types of categorical variables:

Nomial

A nominal variable has no intrinsic ordering to its categories. For example, gender is a categorical variable having two categories (Male and Female) with no inherent ordering between them. Another example is Country (India, Australia, America, and so forth).

Ordinal

An ordinal variable has a clear ordering within its categories. For example, consider temperature as a variable with three distinct (but related) categories (low, medium, high). Another example is an education degree (Ph.D., Master’s, or Bachelor’s).

Different Approaches to Handle Categorical Data

· One Hot Encoding

· One Hot Encoding with multiple categories

· Ordinal Number Encoding

· Count or Frequency Encoding

· Target guided Ordinal Encoding

· Mean Ordinal Encoding

· Probability Ratio Encoding

One Hot Encoding

This technique is applied for nomial categorical features.

In one Hot Encoding method, each category value is converted into a new column and assigned a value as 1 or 0 to the column.

This will be done using the pandas get_dummies() function and then we will drop the first column in order to avoid dummy variable trap.

Advantages :

· Simple to use and fits well for data with few categories.

Disadvantages:

· A high cardinality of higher categories will increase the feature space, resulting in the curse of dimensionality.

One Hot Encoding with Multiple Categories

This is one of the ensemble selection techniques pick up from the KDD Orange Cup competition. In this technique, the author made a slight modification to the One hot encoding technique that is instead of creating the new column for every category, they limit creating the new column for 10 most frequent categories. Sounds like a Jargon !!!! Me too 😊

Let’s looks at the below code to understand it better:

Advantages:

· Easy to implement

· Does not expand massively the feature space

Disadvantages :

· Does not keep track of category values that are overlooked.

Ordinal Number Encoding

As the name implies, this technique is used for ordinal categorical features.

In this technique, each unique category value is given an integer value. For instance, “red” equals 1, “green” equals 2 and “blue” equals 3.

Domain information can be used to determine the integer value order. For example, we people love Saturday and Sundays, and most hates Monday. In this scenario the mapping for weekdays goes ‘Monday’ is 1, ‘Tuesday’ is 2, ‘Wednesday’ is 3, ‘Thursday’ is 4, ‘Friday’ is 5,’Saturday’ is 6,’Sunday’ is 7.

Take a look at the code below for a better understanding:

Advantages :

· Easy and straightforward to implement

· Widely used in survey and research data encoding.

Disadvantages:

· Do not have a standardized interval scale.

Count or Frequency Encoding

As the name implies, in this technique we will substitute the categories by the count of the observations that show that category in the dataset.

As an example. If India appears 56 times in the country column and America appears 49 times, we replace India with 56 and America with 49 in the country column.

Advantages:

· Easy to implement

· There will be no increase in feature space.

· Work well with the tree-based algorithms.

Disadvantages:

It will not provide the same weight if the frequencies are the same.

Target guided Ordinal Encoding

In this technique, we will transform our categorical variable by comparing it to the target or output variable.

Steps:

1) Choose a categorical variable.

2) Take the aggregated mean of the categorical variable and apply it to the target variable.

3) Assign higher integer values or a higher rank to the category with the highest mean.

Advantages:

· Establish a monotonic relationship between the variable and the target.

· Helps in faster learning

Disadvantages:

· Because of the close relationship to the target variable, it often leads to overfitting.

Mean Ordinal Encoding

It’s a sight variant of target-guided ordinal encoding and is viral among data scientists. We replace the category with the obtained mean value instead of assigning integer values to it.

Advantages:

· Improves classification model efficiency.

· Fast acquisition of information

Disadvantages:

· Leads to overfitting

· May lead to possible loss of value if two categories have the same mean

Probability Ratio Encoding

This technique is suitable for classification problems only when the target variable is binary(Either 1 or 0 or True or False).

In this technique, we will substitute the category value with the probability ratio i.e. P(1)/P(0).

Steps :

1) Using the categorical variable, evaluate the probability of the Target variable (where the output is True or 1).

2) Calculate the probability of the Target variable having a False or 0 output.

3) Calculate the probability ratio i.e. P(True or 1) / P(False or 0).

4) Replace the category with a probability ratio.

Advantages:

· Do not expand the feature space.

· Captures information from within the category, resulting in more predictive features.

Disadvantages:

· Not defined when the denominator is 0.

· It sometimes results in overfitting.

Let’s wrap up this mediation by discussing which technique is best suited for the problem statement or our models.

To be honest, there is no one-size-fits-all solution for any dataset or problem statement. We need to test a few cases to see which ones produce the best results.

References :

· https://www.youtube.com/watch?v=uWD-r7GZppg

· https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

Thank you for taking the time to read this post. If you liked this read, hit the 👏 button and share it with others. If you have any questions, please leave them in the comments section and I will do my best to reply.

You can connect with me on LinkedIn, Facebook, and Instagram.

Until next time, Adios Amigo !!!!!

--

--