How to handle Categorical variables?

Published in

Geek Culture

6 min readMay 24, 2021

Data Science is the art and science of solving real-world problems and making data-driven decisions. It primarily deals with all kinds of structured or unstructured data. Data, broadly, can be divided into two types i.e., Numerical and Categorical. Most of the data science models are equipped to work with numerical data; however, things get interesting when we have to deal with Categorical data.

What is Categorical data?

Categorical data is a form of data that takes on values within a finite set of discrete classes. It is difficult to count or measure categorical data using numbers and therefore they are divided into categories. An example of categorical data would be Gender of a person. It can only take values between Male, Female, and Others.

There are two types of categorical variables:

I. Ordinal Variables

II. Nominal Variables

I. Ordinal Variables:

These variables maintain a natural order in their class of values. If we consider the level of education then we can easily sort them according to their education tag in the order of High School < Under-Graduate<post-Graduate < PhD. The review rating system can also be considered as an ordinal data type where 5 stars is definitely better than 1 star.

II. Nominal Variables:

These variables do not maintain any natural/logical order. The color of a car can be considered as Nominal Variable as we cannot compare the color with each other. It is impossible to state that “Red” is better than “Blue” (subjective!). Similarly, Gender is a type of Nominal Variable as again we cannot differentiate between Male, Female, and Others.

Encoding Categorical Data:

Most of the Machine learning algorithms are designed to work with numeric data. Hence, we need to convert Categorical(text) data into numerical data for model building. There are multiple encoding techniques to convert categorical data into numerical data. Let’s look at some of them.

There are multiple ways of encoding techniques to deal with these variables.

The most common techniques to deal with the categorical variables are:

· Ordinal Encoding

· One Hot Encoding

· Label Encoding

· Count Encoding/ Frequency

· Leave One Out Encoding

· Mean Encoding

I will focus on three different types of techniques to handle the categorical data.

Here is a preview of our data, Github Code Repo

df.head()

1- Ordinal Encoding:

Ordinal Encoding primarily encodes ordinal categories into ordered numerical values. Ordinal encoding maps each unique category value to a specific numerical value based on its order or rank. Consider the education column in the given data frame. Here, we define the ordering of the categories when creating an ordinal encoder using sklearn. so, in the example, we arrange the order inside the categories as a list in ascending order. First, we have the High School followed by Associate,Master, and then Ph.D. at the end.

[OrdinalEncoder(categories=[[‘HS’, ‘AS’, ‘M.S’, ‘Ph.D.’]])]

df['Education'].unique()

from sklearn.preprocessing import OrdinalEncoder
ordinal = OrdinalEncoder(categories=[['HS', 'AS', 'M.S','Ph.D']])
df['Education'] = ordinal.fit_transform(df[['Education']])
df.head()

The ordinal encoder is the most suitable option for encoding ordinal variables. It helps the machine learning model to establish a relationship between a categorical column and the target column. For example, if we want to predict the salary of an employee, it would depend on different features, and education level would be one of those features. Now, logically the one with Ph.D. will have a better salary than the one with a high school degree. so, the model will learn that a Ph.D. with a value of 3 in the data frame weighs more than the one with a high school degree with a value of 0. This way the model will learn that when the level of education goes up, the salary increases and vice versa.

2. One Hot Encoding:

If there is no ordinal relationship between the categorical variables then ordinal encoding might mislead the model. This is because the ordinal encoder will try to force an ordinal relationship on the variables to assume a natural ordering, thus resulting in poor performance.

In this case, One Hot encoder should be used to treat our categorical variables. It will create dummy variables by converting N categories into N features/columns. Considering the gender column again. If we have a male in the first row, then its value is 1. Also if we have a female in the second row then its value is 0. Whenever the category exists its value is 1 and 0 where it does not. We can one-hot encode categorical variables in two ways. One, by using get_dummies in pandas and two, by using OneHotEncoder from sklearn.

pd.get_dummies(df['Gender']).head()

Another example is Marital Status. Here, we have three different categories Married: M, Divorced: D, and Single: S. we can reduce the dimensionality by one column by using: “drop_first=True” meaning the number of columns will be one less than the number of categories.

pd.get_dummies(df['Marital Status'],drop_first=True).head()

In the second row of the table above, we have zero for married and single, which effectively means that it is Divorced.

If we assign drop_first =False, then we still have three columns: Married, Single, and Divorced.

pd.get_dummies(df['Marital Status'],drop_first=False).head()

As mentioned, we can also implement one-hot encoding through OneHotEncoder from sklearn.

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(df0[['Marital Status']])

If we have a high number of categorical variables in a column, then we should avoid using one-hot encoding. It will result in an increase in the number of corresponding columns which will give rise to a problem called “Curse of Dimensionality”.

3- Label Encoding:

The label encoder will convert each category into a unique numerical value. If implemented with Sklearn, then this encoder should be used to encode output values, i.e. y, and not the input X. It is similar to the ordinal encoder except, here the numeric values are assigned automatically without following any sort of natural order. Generally, the alphabetical order of the categorical values is used to determine which numerical value comes first. Considering our target variable “Job Status” column has four different categories. After applying label encoding to this column the four different categories are mapped into integers 0: Full Time, 1: Intern, 2: Part-Time, and 3:Unemployed. With this, it can be interpreted that Unemployed have a higher priority than Part-Time, Full Time, and Intern while training the model, whereas, there is no such priority or relation between these statuses. We can’t define the order of labels with the label encoding technique.

from sklearn.preprocessing import LabelEncoder 
lbe = LabelEncoder()
df['Employment Status']= lbe.fit_transform(df['Employment Status']) 
df.head()

The disadvantage to label encoding is that it gives an order to the categorical value, which might not be suitable to some machine learning algorithms such as Linear Regression, as they are too sensitive to the values; in such case, one hot encoding provides better results.
On the other hand, label encoding is suitable with Decision Tree and Random Forest algorithms because they don’t depend on the values of the categorical variables.

Conclusion

We learned about the categorical data and how we must treat them before feeding it to a model. We saw different types of categorical data and also multiple encoding techniques to convert those categorical features into numerical features. It is important to understand which categorical technique to be used and when. A good encoding technique proves vital to the performance of your model.

Thanks for reading!