Ways To Handle Categorical Column Missing Data & Its Implementations
In my last blog Link, I explained different ways to handle Continuous column missing data and its implementation.
In this blog, I will explain how to handle missing values of the Categorical data column in the dataset with implementation using python.
Discrete/ Categorical Data: discrete data is quantitative data that can be counted and has a finite number of possible values or data which may be divided into groups e.g. days in a week, number of months in a year, sex (Male/Female/Others), Grades (High/Medium/Low), etc.
The dataset used to explain is Titanic (Kaggle dataset):
import pandas as pd
import numpy as np
Data = pd.read_csv("train.csv")
Data.isnull().sum()
Data.dtypes()
The number of categories in each column:
# Code to get number of categories in missing value columnsprint("Number of Categories in: ")for ColName in DataFrame[['Embarked','Cabin_Serial','Cabin']]:
print("{} = {}".format(ColName, len(DataFrame[ColName].unique())))
- Frequent Categorical Imputation
Assumptions: Data is Missing At Random (MAR) and missing values look like the majority.
Description: Replacing NAN values with the most frequent occurred category in variable/column.
Implementation:
Step 1: Find which category occurred most in each category using mode().
Step 2: Replace all NAN values in that column with that category.
Step 3: Drop original columns and keep newly imputed columns.
#1. Function to replace NAN values with mode valuedef impute_nan_most_frequent_category(DataFrame,ColName):
# .mode()[0] - gives first category name
most_frequent_category=DataFrame[ColName].mode()[0]
# replace nan values with most occured category
DataFrame[ColName + "_Imputed"] = DataFrame[ColName]
DataFrame[ColName + "_Imputed"].fillna(most_frequent_category,inplace=True)#2. Call function to impute most occured categoryfor Columns in ['Embarked','Cabin_Serial','Cabin']:
impute_nan_most_frequent_category(DataFrame,Columns)
# Display imputed result
DataFrame[['Embarked','Embarked_Imputed','Cabin_Serial','Cabin_Serial_Imputed','Cabin','Cabin_Imputed']].head(10)#3. Drop actual columnsDataFrame = DataFrame.drop(['Embarked','Cabin_Serial','Cabin'], axis = 1)
Advantage: Simple and easy to implement for categorical variables/columns.
Disadvantage:
- Features having a max number of null values may bias prediction if replace with the most occurred category.
- It distorts the relation of the most frequent label.
2. Adding a Variable To Capture NAN
Assumptions: No assumption, can be work with all type categorical columns.
Description: Replace NAN categories with most occurred values, and add a new feature to introduce some weight/importance to non-imputed and imputed observations.
Implementation:
Step 1. Create a new column and replace 1 if the category is NAN else 0. This column is an importance column to the imputed category.
Step 2. Replace NAN value with most occurred category in the actual column.
# Function to impute most occured category and add importance vairabledef impute_nan_add_vairable(DataFrame,ColName):
#1. add new column and replace if category is null then 1 else 0 DataFrame[ColName+"_Imputed"] = np.where(DataFrame[ColName].isnull(),1,0)
# 2. Take most occured category in that vairable (.mode())
Mode_Category = DataFrame[ColName].mode()[0]
## 2.1 Replace NAN values with most occured category in actual vairable
DataFrame[ColName].fillna(Mode_Category,inplace=True)# Call function to impute NAN values and add new importance featurefor Columns in ['Embarked','Cabin_Serial','Cabin']:
impute_nan_add_vairable(DataFrame,Columns)
# Display top 10 row to see the result of imputationDataFrame[['Embarked','Embarked_Imputed','Cabin_Serial','Cabin_Serial_Imputed','Cabin','Cabin_Imputed']].head(10)
Advantage: Capture the importance of missingness.
Disadvantage:
- Creating Additional Features(Curse of Dimensionality) e.g. if there are 10 columns have null values need to create 10 extra columns.
- Potentially misunderstood data & the number of missing data should be large enough.
3. Create a New Category (Random Category) for NAN Values
Assumptions: No assumption
Description: Create a new category for NAN values i.e random category.
Implementation:
Step 1. Replace NAN value with a new name (here we create a new category as Unknown).
Step 2. Display result
#1. Function to impute null value with new category
def impute_nan_create_category(DataFrame,ColName):
DataFrame[ColName] = np.where(DataFrame[ColName].isnull(),"Unknown",DataFrame[ColName])## Call function to create new category for variables
for Columns in ['Embarked','Cabin_Serial','Cabin']:
impute_nan_create_category(DataFrame,Columns)#2. Display result
DataFrame[['Embarked','Cabin_Serial','Cabin']].head(10)
Advantage: Simple and easy to implement for categorical variables/columns and preserves the variance.
Disadvantage:
- May create random data if the missing category is more.
- Doesn’t give good results when missing data is a high percentage of the data.
Conclusion:
The above implementation is to explain different ways we can handle missing categorical data. The most widely used methods are Create a New Category (Random Category) for NAN Values and Most frequent category imputation.
For reference: Jupyter notebook — code available at GitHub: https://github.com/GDhasade/Medium.com_Contents/blob/master/Handle_Categorical_Missing_Data.ipynb