Ways To Handle Categorical Column Missing Data & Its Implementations

Ganesh Dhasade
Analytics Vidhya
Published in
4 min readSep 1, 2020

In my last blog Link, I explained different ways to handle Continuous column missing data and its implementation.

In this blog, I will explain how to handle missing values of the Categorical data column in the dataset with implementation using python.

Image from: 365datascience.com

Discrete/ Categorical Data: discrete data is quantitative data that can be counted and has a finite number of possible values or data which may be divided into groups e.g. days in a week, number of months in a year, sex (Male/Female/Others), Grades (High/Medium/Low), etc.

The dataset used to explain is Titanic (Kaggle dataset):

import pandas as pd
import numpy as np
Data = pd.read_csv("train.csv")
Data.isnull().sum()
Data.dtypes()
Cabin_Serial, Cabin and Embarked Categorical Variable has NAN values

The number of categories in each column:

# Code to get number of categories in missing value columnsprint("Number of Categories in: ")for ColName in DataFrame[['Embarked','Cabin_Serial','Cabin']]:
print("{} = {}".format(ColName, len(DataFrame[ColName].unique())))
  1. Frequent Categorical Imputation

Assumptions: Data is Missing At Random (MAR) and missing values look like the majority.

Description: Replacing NAN values with the most frequent occurred category in variable/column.

Implementation:

Step 1: Find which category occurred most in each category using mode().

Step 2: Replace all NAN values in that column with that category.

Step 3: Drop original columns and keep newly imputed columns.

#1. Function to replace NAN values with mode valuedef impute_nan_most_frequent_category(DataFrame,ColName):
# .mode()[0] - gives first category name
most_frequent_category=DataFrame[ColName].mode()[0]

# replace nan values with most occured category
DataFrame[ColName + "_Imputed"] = DataFrame[ColName]
DataFrame[ColName + "_Imputed"].fillna(most_frequent_category,inplace=True)
#2. Call function to impute most occured categoryfor Columns in ['Embarked','Cabin_Serial','Cabin']:
impute_nan_most_frequent_category(DataFrame,Columns)

# Display imputed result
DataFrame[['Embarked','Embarked_Imputed','Cabin_Serial','Cabin_Serial_Imputed','Cabin','Cabin_Imputed']].head(10)
#3. Drop actual columnsDataFrame = DataFrame.drop(['Embarked','Cabin_Serial','Cabin'], axis = 1)
Impute most occurred category in place of NAN value

Advantage: Simple and easy to implement for categorical variables/columns.

Disadvantage:

  • Features having a max number of null values may bias prediction if replace with the most occurred category.
  • It distorts the relation of the most frequent label.

2. Adding a Variable To Capture NAN

Assumptions: No assumption, can be work with all type categorical columns.

Description: Replace NAN categories with most occurred values, and add a new feature to introduce some weight/importance to non-imputed and imputed observations.

Implementation:

Step 1. Create a new column and replace 1 if the category is NAN else 0. This column is an importance column to the imputed category.

Step 2. Replace NAN value with most occurred category in the actual column.

# Function to impute most occured category and add importance vairabledef impute_nan_add_vairable(DataFrame,ColName):
#1. add new column and replace if category is null then 1 else 0
DataFrame[ColName+"_Imputed"] = np.where(DataFrame[ColName].isnull(),1,0)

# 2. Take most occured category in that vairable (.mode())

Mode_Category = DataFrame[ColName].mode()[0]

## 2.1 Replace NAN values with most occured category in actual vairable

DataFrame[ColName].fillna(Mode_Category,inplace=True)
# Call function to impute NAN values and add new importance featurefor Columns in ['Embarked','Cabin_Serial','Cabin']:
impute_nan_add_vairable(DataFrame,Columns)

# Display top 10 row to see the result of imputation
DataFrame[['Embarked','Embarked_Imputed','Cabin_Serial','Cabin_Serial_Imputed','Cabin','Cabin_Imputed']].head(10)
Imputed NAN with the most occurred category and add a new Importance variable/column.

Advantage: Capture the importance of missingness.

Disadvantage:

  • Creating Additional Features(Curse of Dimensionality) e.g. if there are 10 columns have null values need to create 10 extra columns.
  • Potentially misunderstood data & the number of missing data should be large enough.

3. Create a New Category (Random Category) for NAN Values

Assumptions: No assumption

Description: Create a new category for NAN values i.e random category.

Implementation:

Step 1. Replace NAN value with a new name (here we create a new category as Unknown).

Step 2. Display result

#1. Function to impute null value with new category
def impute_nan_create_category(DataFrame,ColName):
DataFrame[ColName] = np.where(DataFrame[ColName].isnull(),"Unknown",DataFrame[ColName])
## Call function to create new category for variables
for Columns in ['Embarked','Cabin_Serial','Cabin']:
impute_nan_create_category(DataFrame,Columns)
#2. Display result
DataFrame[['Embarked','Cabin_Serial','Cabin']].head(10)
Add new category

Advantage: Simple and easy to implement for categorical variables/columns and preserves the variance.

Disadvantage:

  • May create random data if the missing category is more.
  • Doesn’t give good results when missing data is a high percentage of the data.

Conclusion:

The above implementation is to explain different ways we can handle missing categorical data. The most widely used methods are Create a New Category (Random Category) for NAN Values and Most frequent category imputation.

For reference: Jupyter notebook — code available at GitHub: https://github.com/GDhasade/Medium.com_Contents/blob/master/Handle_Categorical_Missing_Data.ipynb

--

--

Ganesh Dhasade
Analytics Vidhya

Data - Scientist | Analyst | Engineer | Enthusiast | ML Engineer