Crash Course in Data: Imputation techniques for Categorical features

Akhilesh Dongre

Published in

AI Skunks

8 min readMar 9, 2023

Discussing Techniques for Imputation for Categorical Variables using Python

ABSTRACT

Non-data is an integral part of the data you are dealing with; it is existentially undefined, not inaccurate. In the meanwhile, missing data raises the issue of actionability: Now to Answer Whether or not to keep the noise in the data?

Missing values are common in real-world data. To train a model or do insightful analysis, you will typically need to remove these missing values from your data. Here are a few Python methods that may be used to imputation (fill in) missing values for categorical and numeric data.

The use of mean, median, or zero-imputation for categorical predictors is not particularly useful. Here, I’ll generate a sample dataset with categorical characteristics and demonstrate two imputation techniques that are appropriate for this kind of data/

A great study and explanation on the importance of Missing Data : Study on Missing Data

Importance and Relevance of Null Data (NaN Values)

Example 1:

**Real-life Data from NYPD with NaN Values**

Each of these entries in the above table relates to an incident that the New York City Police Department (NYPD) received a complaint about and attended to at a specific place. But, as seen in this example, some entries are completely Null.

Given that these are accidents which were reported, So location being Null gives an arbitrary meaning to the whole entry in the dataset.

At this point domain knowledge or context comes into the picture as we now think of how to tackle these Null values and which technique would be useful.

A collision location is recorded as a combination of ON STREET and CROSS STREET in incidents in which cars collide mid-street or mid-avenue. But not all collisions occur on streets; some occur OFF STREET, like, say, in parking lots.

Then the other columns don’t carry any informational content at all. They exemplify non-data.

Example 2:

A second illustrative example — this time a map of all single-family home property sales in New York City:

Take note that some locations, like Central Park, are grayed out. Simply put, there are no real estate sales in these public spaces. As zero cannot be divided by, these regions are once again examples of non-data.

This means that there are really three alternative responses to the query “Is this data entry filled?”: “Yes,” “No, but it may be,” and “No, and it cannot be.”

TYPES OF MISSING DATA

When dealing with missing data in Machine Learning, one of the most important steps is imputation. Imputation is the process of replacing missing values with estimated values. This can be done through different techniques, and the choice of the technique will depend on the type of data you are working with. In this article, we will focus on techniques for imputing missing values in categorical variables.

Before we dive into specific coding techniques, it is important to understand the different types of missing data. There are three types of missing data:

Missing Completely At Random (MCAR): When the missing values are not related to any other variable in the dataset.

Missing At Random (MAR): When the missing values are related to other variables in the dataset.

Missing Not At Random (MNAR): When the missing values are related to the value itself.

Imputation Method 1: Most Frequent Class

Replacing the nulls with the most frequent class.

df = {"X1": [np.nan, " Blue " , "Blue", "Red", np.nan, "Red", "Green", np.nan, "Red", "Red"], "X2": ["Green", " Blue", "Green", "Blue", "Green" , "Blue" , np.nan, "Red", " Blue", np.nan ]} 
colors = pd.DataFrame(df) 
print(colors)
# for each column, get value counts in decreasing order and take the index (value) of most frequent predictor value

df_imputed = colors.apply(lambda x: x.fillna(x.value_counts().index[0]))
df_imputed

Imputation Method 2: Fill na with “Unknown” and maintain the count of “unknowns”

Maintaining a separate count of unknowns to analyse and access them separately

df_unknown_imputed = colors.fillna("Unknown")
df_unknown_imputed

df_unknown_imputed.value_counts()

About the dataset going to be used moving forward: Information on Candidates with some demographic attributes

A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Many people signup for their training. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Information related to demographics, education, experience are in hands from candidates signup and enrollment.

This dataset designed to understand the factors that lead a person to leave current job for HR researches too. By model(s) that uses the current credentials, demographics, experience data you will predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision.

Imputation Method 3: Using KNN from Fancyimpute library

#install fancyimpute
!pip install fancyimpute
import fancyimpute

#Loading Data
url_train = 'https://raw.githubusercontent.com/akh-04/AED_1/main/aug_train.csv'
train = pd.read_csv(url_train)

url_test = 'https://raw.githubusercontent.com/akh-04/AED_1/main/aug_test.csv'
test = pd.read_csv(url_test)

#Relevant Columns
columns = ['city',
 'city_development_index',
 'gender',
 'relevent_experience',
 'enrolled_university',
 'education_level',
 'major_discipline',
 'experience',
 'company_size',
 'company_type',
 'last_new_job',
 'training_hours']

# Remove index data/irrelevant data from set
train.drop('enrollee_id', inplace=True, axis=1)
test.drop('enrollee_id', inplace=True, axis=1)

imputer = KNN(k=3)
train_ = train[columns]
fullset = pd.concat([train_, test]) 
#fancy impute removes column names
fullset = pd.DataFrame(imputer.fit_transform(fullset))
fullset.columns = columns

Other fancyimpute methods >> https://pypi.org/project/fancyimpute/

SimpleFill: Replaces missing entries with the mean or median of each column.

KNN: Nearest neighbour imputations which weight samples using the mean squared difference on features for which two rows both have observed data.

SoftImpute: Matrix completion by iterative soft thresholding of SVD decompositions. Inspired by the softImpute package for R, which is based on Spectral Regularization Algorithms for Learning Large Incomplete Matrices by Mazumder et. al.

IterativeImputer: A strategy for imputing missing values by modelling each feature with missing values as a function of other features in a round-robin fashion. A stub that links to scikit-learn’s IterativeImputer.

IterativeSVD: Matrix completion by iterative low-rank SVD decomposition. Should be similar to SVDimpute from Missing value estimation methods for DNA microarrays by Troyanskaya et. al.

MatrixFactorization: Direct factorization of the incomplete matrix into low-rank U and V, with an L1 sparsity penalty on the elements of U and an L2 penalty on the elements of V. Solved by gradient descent.

NuclearNormMinimization: Simple implementation of Exact Matrix Completion via Convex Optimization by Emmanuel Candes and Benjamin Recht using cvxpy. Too slow for large matrices.

BiScaler: Iterative estimation of row/column means and standard deviations to get doubly normalized matrix. Not guaranteed to converge but works well in practice. Taken from Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.

Imputation Method 4: Multiple imputation techniques for imputing null values in a dataset using Python and the pandas and sklearn libraries

In this example, we first load the dataset data.csv using pandas. We then create a copy of the dataset to use for imputation (df_imputed) and specify the columns with null values that we want to impute (cols_to_impute). We create an instance of the IterativeImputer class from sklearn and use it to impute the null values in the specified columns. Finally, we check if there are any null values remaining in the imputed dataset by using the isnull().sum() method on the dataset. If there are no null values remaining, then the imputation was successful.

import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Create a copy of the dataset to use for imputation
df_imputed = train.copy()

# Specify the columns with null values to impute
cols_to_impute = ['city',
 'city_development_index',
 'gender',
 'relevent_experience',
 'enrolled_university',
 'education_level',
 'major_discipline',
 'experience',
 'company_size',
 'company_type',
 'last_new_job',
 'training_hours']

# Create an instance of the IterativeImputer class
imputer = IterativeImputer()

# Use the imputer to impute the null values in the specified columns
df_imputed[cols_to_impute] = imputer.fit_transform(df_imputed[cols_to_impute])

# Check if there are any null values remaining in the dataset
print(df_imputed.isnull().sum())
df_imputed.enrolled_university=round(df_imputed.enrolled_university)
# For plot
sns.catplot(data = df_imputed, x="enrolled_university", y="city_development_index")

Conclusion

The medium article has discussed on why imputation is necessary and 4 different techniques of imputation for categorical variables have been illustrated with code.

License

All code in this notebook is available as open source through the MIT license.

All text and images are free to use under the Creative Commons Attribution 3.0 license. https://creativecommons.org/licenses/by/3.0/us/

These licenses let people distribute, remix, tweak, and build upon the work, even commercially, as long as they give credit for the original creation.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.