Ways To Handle Categorical Column Missing Data & Its Implementations

Published in

Analytics Vidhya

4 min readSep 1, 2020

In my last blog Link, I explained different ways to handle Continuous column missing data and its implementation.

In this blog, I will explain how to handle missing values of the Categorical data column in the dataset with implementation using python.

Discrete/ Categorical Data: discrete data is quantitative data that can be counted and has a finite number of possible values or data which may be divided into groups e.g. days in a week, number of months in a year, sex (Male/Female/Others), Grades (High/Medium/Low), etc.

The dataset used to explain is Titanic (Kaggle dataset):

import pandas as pd
import numpy as np
Data = pd.read_csv("train.csv")
Data.isnull().sum()
Data.dtypes()

Cabin_Serial, Cabin and Embarked Categorical Variable has NAN values

The number of categories in each column:

# Code to get number of categories in missing value columnsprint("Number of Categories in: ")for ColName in DataFrame[['Embarked','Cabin_Serial','Cabin']]:
    print("{} = {}".format(ColName,       len(DataFrame[ColName].unique())))

Frequent Categorical Imputation

Assumptions: Data is Missing At Random (MAR) and missing values look like the majority.

Description: Replacing NAN values with the most frequent occurred category in variable/column.

Implementation:

Step 1: Find which category occurred most in each category using mode().

Step 2: Replace all NAN values in that column with that category.

Step 3: Drop original columns and keep newly imputed columns.

#1. Function to replace NAN values with mode valuedef impute_nan_most_frequent_category(DataFrame,ColName):
    # .mode()[0] - gives first category name
     most_frequent_category=DataFrame[ColName].mode()[0]
    
    # replace nan values with most occured category
     DataFrame[ColName + "_Imputed"] = DataFrame[ColName]
     DataFrame[ColName + "_Imputed"].fillna(most_frequent_category,inplace=True)#2. Call function to impute most occured categoryfor Columns in ['Embarked','Cabin_Serial','Cabin']:
    impute_nan_most_frequent_category(DataFrame,Columns)
    
# Display imputed result
DataFrame[['Embarked','Embarked_Imputed','Cabin_Serial','Cabin_Serial_Imputed','Cabin','Cabin_Imputed']].head(10)#3. Drop actual columnsDataFrame = DataFrame.drop(['Embarked','Cabin_Serial','Cabin'], axis = 1)

Impute most occurred category in place of NAN value

Advantage: Simple and easy to implement for categorical variables/columns.

Disadvantage:

Features having a max number of null values may bias prediction if replace with the most occurred category.
It distorts the relation of the most frequent label.

2. Adding a Variable To Capture NAN

Assumptions: No assumption, can be work with all type categorical columns.

Description: Replace NAN categories with most occurred values, and add a new feature to introduce some weight/importance to non-imputed and imputed observations.

Implementation:

Step 1. Create a new column and replace 1 if the category is NAN else 0. This column is an importance column to the imputed category.

Step 2. Replace NAN value with most occurred category in the actual column.

# Function to impute most occured category and add importance vairabledef impute_nan_add_vairable(DataFrame,ColName):
    #1. add new column and replace if category is null then 1 else 0    DataFrame[ColName+"_Imputed"] =   np.where(DataFrame[ColName].isnull(),1,0)
    
    # 2. Take most occured category in that vairable (.mode())
    
    Mode_Category = DataFrame[ColName].mode()[0]
    
    ## 2.1 Replace NAN values with most occured category in actual vairable
    
    DataFrame[ColName].fillna(Mode_Category,inplace=True)# Call function to impute NAN values and add new importance featurefor Columns in ['Embarked','Cabin_Serial','Cabin']:
    impute_nan_add_vairable(DataFrame,Columns)
    
# Display top 10 row to see the result of imputationDataFrame[['Embarked','Embarked_Imputed','Cabin_Serial','Cabin_Serial_Imputed','Cabin','Cabin_Imputed']].head(10)

Imputed NAN with the most occurred category and add a new Importance variable/column.

Advantage: Capture the importance of missingness.

Disadvantage:

Creating Additional Features(Curse of Dimensionality) e.g. if there are 10 columns have null values need to create 10 extra columns.
Potentially misunderstood data & the number of missing data should be large enough.

3. Create a New Category (Random Category) for NAN Values

Assumptions: No assumption

Description: Create a new category for NAN values i.e random category.

Implementation:

Step 1. Replace NAN value with a new name (here we create a new category as Unknown).

Step 2. Display result

#1. Function to impute null value with new category
def impute_nan_create_category(DataFrame,ColName):
     DataFrame[ColName] = np.where(DataFrame[ColName].isnull(),"Unknown",DataFrame[ColName])## Call function to create new category for variables
for Columns in ['Embarked','Cabin_Serial','Cabin']:
    impute_nan_create_category(DataFrame,Columns)#2. Display result
DataFrame[['Embarked','Cabin_Serial','Cabin']].head(10)

Advantage: Simple and easy to implement for categorical variables/columns and preserves the variance.

Disadvantage:

May create random data if the missing category is more.
Doesn’t give good results when missing data is a high percentage of the data.

Conclusion:

The above implementation is to explain different ways we can handle missing categorical data. The most widely used methods are Create a New Category (Random Category) for NAN Values and Most frequent category imputation.

For reference: Jupyter notebook — code available at GitHub: https://github.com/GDhasade/Medium.com_Contents/blob/master/Handle_Categorical_Missing_Data.ipynb

Ways To Handle Categorical Column Missing Data & Its Implementations

Written by Ganesh Dhasade