DATA PREPROCESSING: Decreasing Categories in Categorical Data

Raghuvansh Tahlan
Analytics Vidhya
Published in
6 min readNov 2, 2020
Photo by Markus Spiske

The following article will look at various data types and focus on Categorical data and answer as to Why and How to reduce categories and end with hands-on example in Python. The article will also look at a different method which can be used for missing value imputation and reducing categories.

Data Preprocessing and various types of Data

Data preprocessing enhances the quality of data to promote the extraction of meaningful insights. In Machine Learning, it refers to the transformation of raw data to make it viable for Machine learning Models. Data can be Continuous or Discrete. Discrete data can only take particular values. There may potentially be an infinite number of those values, but each is distinct, and there’s no grey area in between. Discrete data can be numeric (like numbers of apples) but can also be categorical — like red or blue, or good or bad. Continuous data are not restricted to defined distinct values but can take any value over a continuous range. Between any two continuous data values, there may be an infinite number of others. Continuous data are always essentially numeric.

Categorical Data

Categorical data can be classified into Nominal or Ordinal. Nominal referring to data that has more than one categories, but there is no intrinsic ordering between the categories which is present in the ordinal data. For example, gender is a categorical variable having two classes (male and female), and there is no intrinsic ordering to the categories. In contrast, a value in happiness scale (0–10) has natural sequence 10, indicating more happiness than 0.

Why reduce Categories in a Categorical Variable?

Training an ML/DL model on the dataset and achieving satisfactory results on Testing and Validation sets may be an end goal for personal projects. Still, in organisations, these results should also coincide with the thoughts of Management. Say an ML/DL model having decent accuracy indicates that a particular categorical variable is significantly important containing 15–20 categories. Now, this can be overwhelming because it is difficult to understand how different values of the variable affects the outcome, but if there were 4–5 categories, this could be quickly figured.

The basis on which categories can be reduced?

Consider two cases first in which variable is a constant that is all values of the variable are same and second in which all values of the variable are different something like “ID” for each row.

In both cases, it is recommended to remove the variable before training the model because the first case offers no variations and second offers high variability. An ML/DL model can understand the patterns in the data, but it fails in case of very low or high variability. In layman terms, how can anyone predict if all values are different and what’s the need to predict when all values are the same.

On similar terms, all categories whose occurrence are less than 1% can be ignored/reduced.

How to reduce Categories?

When categories to be removed are decided, it’s pretty simple, replace those values by NULL values and handle them in Missing Value treatment.

The various methods for missing values treatment include:

1. Central Tendency — Mean/Median/Mode

2. Most Frequent or Constant value imputation.

3. Multivariate Imputation by Chained Equation (MICE).

4. Imputation using KNN

5. Imputation using Machine Learning/Deep Learning.

Refer to an article by Will Badr.

Example

The following article showcases a data preprocessing code walkthrough and some example on how to reduce the categories in a Categorical Column using Python.

  1. Importing the Data

2. Checking for NULL Values and finding the categorical columns.

3. Let’s examine these columns one by one starting by ‘Applicant_Marital_Status’.

Applicant_Marital_Status by Raghuvansh Tahlan

There are a total of 5045 records of which 30 missing and there are two categories ‘W’ and ‘D’ which corresponds to Widowed and Divorced respectively which have very fewer occurrences which can be reduced.

This can be done by replacing ‘W’ and ‘D’ by NULL values.

The count of NULL values also indicates our code was executed correctly.

4. Now, these values can be imputed by any of the missing value imputations techniques.

Are Imputations by Central Tendency always a good choice?

Let’s look at an example of another column ‘Applicant_Occupation’.

As we can clearly see Student as only 19 records which correspond to 0.38% of total values, so it seems viable to remove this category.

Everything works, but there are 1023 records of missing values which is a concern because it corresponds to almost 20% of total records.

Now let’s consider if we impute these values by central tendency — Mode.

Now the percentage of salaried people have increased from 51 to 71, which is not convincing. Comparing the highest and lowest occurring category, their ratio has increased from 8 to 11.

This problem can be avoided if we use other imputations techniques like KNN, DL, MICE etc.

My Method to tackle this problem

This method believes that the percentage of each category in the population will remain constant even after imputation. So, the missing values to be imputed are divided among categories of the population according to their percentage in the population. To demonstrate this, we take the above example:

Initially, for Salaried category 2560 is 63% of the total 4022 non-null values then after imputation, this value should remain constant so keeping 63% constant there should be 3211 values if the total count for values is 5045. This leaves us with 651 values to be imputed. Similarly, values for all category are calculated.

This leaves us with the task to automate this process and impute a finite number of null values from the column.

Automation

The function ‘find_index’ gives us the index from starting of the column till which some specified number of NULL values are present.

The function ‘replace_cat_list’ function uses the ‘find_index’ function to automate the process of filling the null values according to the population of the categories provided. Using the same example:

The ‘replace_cat_list’ takes the dataframe, name of the column to be imputed and list of categories to be taken into consideration.

Suggestions and comments are welcomed. Connect with me on LinkedIn. All source codes for this article are available at Github.

--

--

Raghuvansh Tahlan
Analytics Vidhya

Passionate about Data Science. Stock Market and Sports Analytics is what keeps me going. Writer at Analytics Vidhya Publication. https://github.com/rvt123