Data Preprocessing: Part IV- Categorical Encoding

Photo by Milena Trifonova on Unsplash

“There are two types of statistics, the kind you look up and the kind you make up.” — Rex Stout

In the past few articles we discussed data types, data transformations and prediction of missing values for numerical data. Continuing further, this article focuses on transforming categorical data through auto-encoders in numerical format thus making it adaptable for filling in missing values. We have used the “Mental Health Survey” dataset for discussing these concepts. The dataset can be downloaded from here.

For example, as shown in the above image, the features Country, state, self_employed, family_history, treatment and work_interfere represent observations in categorical format. For using such observations, one needs to convert them into numerical form for making it as suitable input for the underlying Machine Learning algorithms.

Label encoder

This is a simple approach for conversion where each category is assigned a sequential number. For example, each category of work_interfere column from the above image will be assigned as 0, 1, 2, 3,… and so on.

This can be achieved with pandas as well as scikit library in python.

In case of sequential numbers,a problem of precedence among these values can occur.. The model might consider the value Never of higher precedence than Often as 3 >2. Therefore, we need a more robust approach to overcome this limitation

One-hot encoding

This method overcomes the limitation of label encoding with the concept of dummy variables. In this method, each category is treated as a new variable (dummy), and the observation is assigned as 0 or 1 depending upon the category it belongs to.

Let’s consider the same example as above with one hot encoder.

As we can see, the categorical values are converted to numeral values and each category has equal weightage. This subdues the precedence problem; but increases the number of columns in the dataset.

Pandas as well as scikit library can be used for one-hot encoding.

The entire code for both the approaches can be found here.

Get in touch!

Reach out to us at perspectivesondatascience@gmail.com for any question and we will be happy to answer!

--

--

Insights on Modern Computation
Perspectives on data science

A Communal initiative by Meghana Kshirsagar (BDS| Lero| UL, Ireland), Gauri Vaidya (Intern|BDS). Each concept is followed with sample datasets and Python codes.