Know This Before Dropping The Missing Values

6 min readNov 22, 2022

What is the proper way to handle the missing values in the dataset? when should I drop the missing values? When should I impute the missing values?

Photo from Unsplash uploaded by Brett Jordan

“You can have data without information, but you cannot have information without data.”
(Daniel Keys Moran, American Fiction Writer)

We know Machine learning algorithms are not capable of handling missing values. As data scientists, it is our responsibility to handle the missing values properly.

There are many ways in which we can handle missing values, in this blog, I will show a technique called CCA in detail.

Table of Contents:

∘ Various ways of handling the missing data
∘ Complete Case Analysis
∘ When to use CCA
∘ Advantages of CCA
∘ Disadvantages of CCA
∘ Let us look at the practical example of CCA

Various ways of handling the missing data

Either remove the missing rows altogether(complete case analysis) or drop the whole column(If the missing percentage in any of the columns is more).
Impute the missing values with the appropriate technique. We have univariate and multivariate techniques.
We can impute numerical data with mean or median value, with the random value, or with the end of distribution values.
Categorical data can be imputed with either mode or by creating a new category like ‘missing’.
We have multivariate techniques like KNN Imputer and iterative imputer.

A Complete End-to-End Machine Learning Based Recommendation Project

A machine learning recommendation project based on collaborative filtering and popularity-based filtering

pub.towardsai.net

Let us look at the first technique in detail.

Complete Case Analysis

Complete Case Analysis(CCA) also called ‘list-wise deletion’ of cases, consists of discarding observations where values in any of the variables are missing.
In other words, complete case analysis means literally analyzing only those observations for which there is information in all of the variables in the dataset.

Practical Implementation of Content-Based Recommendation System

A complete end-to-end Content-based recommendation system that recommends similar movies based on the user’s input.

pub.towardsai.net

When to use CCA

1. When the data is missing completely at random(Missing data Completly at Random).
2. As a thumb rule, we can apply this technique only when less than 5% of the data is missing.

If we have a dataset with 1000 values and 50 values are missing, we can drop them only when we are sure that those 50 values are missing randomly. And we should keep in mind that there should not be any significant difference in the distribution of the data before and after dropping the missing values.

As a result, it is as good as dropping 50 random rows from the dataset, and the distribution of the data remains the same.

How do I Verify the Assumptions of Linear Regression?

What are the assumptions of linear regression? and how to verify them with python?

pub.towardsai.net

Advantages of CCA

Easy to implement, we do not have to make any data manipulation.
It preserves the data distribution(if data is missing completely at random). Since there will not be any major difference between the distribution of the data before and after dropping the values.

Disadvantages of CCA

If there are many missing values, then we lose most of the observations.
We are losing the information in the other columns.
When using the models in production, the model will not have any know-how to handle the missing values.

Why Is Multicollinearity A Problem?

What is multicollinearity? and why we should take care of multicollinearity before creating a machine-learning model

medium.com

Let us look at the practical example of CCA

import pandas as pd 
import numpy as np
df = pd.read_csv('data_science_job_data.csv')

df.shape
(19158, 13)

df.isnull().sum()/len(df)*100

Let us read the data and check the percentage of missing values.

From the above code, we can see that 5 columns are eligible for applying the CCA technique.

len(df[missing_columns_5percent].dropna())/len(df)*100

89.68577095730244

new_df = df[missing_columns_5percent].dropna()
df.shape, new_df.shape

((19158, 13), (17182, 5))

After applying CCA, we will have 89.6% of the total data. And 17182 observations.

Simple ways to write Complex Patterns in Python in just 4mins.

Easy way to write complex pattern programs in python

medium.com

Let us look at the data distribution, before and after dropping the missing rows for each eligible column.

Dropping the missing rows in the column ‘city_development_index’

plt.figure(figsize=(10,6))
df['city_development_index'].plot.density(color = 'red')
new_df['city_development_index'].plot.density(color = 'green')
plt.show()

The above plots show that there is no significant difference between the distribution of data before and after removing the missing values in the column ‘city_development_index’. So we can apply this technique.

Standardization vs Normalization

Is feature scaling mandatory? when to use standardization? when to use normalization? what will happen to the…

medium.com

2. Dropping the missing rows in the column ‘experience’.

plt.figure(figsize=(10,6))
df['experience'].plot.density(color = 'red')
new_df['experience'].plot.density(color = 'green')
plt.show()

The above plots show that there is no significant difference between the distribution of data before and after removing the missing values in the column ‘experience’. So we can apply this technique.

Confusion Matrix to no Confusion Matrix in just 5mins

What is confusion matrix precision, recall , accuracy, F1-score, FPR, FNR, TPR,TNR ?

pub.towardsai.net

3. Dropping the missing rows in the column ‘training_hours’.

plt.figure(figsize=(10,6))
df['training_hours'].plot.density(color = 'red')
new_df['training_hours'].plot.density(color = 'green')
plt.show()

The above plots show that there is no significant difference between the distribution of data before and after removing the missing values in the column ‘training_hours’. So we can apply this technique.

How Should We Detect and Treat the Outliers?

What are outliers? How do we need to detect outliers? How do we need to treat the outliers?

pub.towardsai.net

4. Dropping the missing rows in the column ‘enrolled_university’.

temp = pd.concat([
            # percentage of observations per category, original data
            df['enrolled_university'].value_counts() / len(df)*100,

            # percentage of observations per category, cca data
            new_df['enrolled_university'].value_counts() / len(new_df)*100
        ],
        axis=1)

# add column names
temp.columns = ['original', 'cca']

temp

The above values show that there is no significant difference between the categories of data before and after removing the missing values in the column ‘enrolled_university’. So we can apply this technique.

Encoding Categorical Data- The Right Way

Types of Data

pub.towardsai.net

5. Dropping the missing rows in the column ‘education_level’.

temp = pd.concat([
            # percentage of observations per category, original data
            df['education_level'].value_counts() / len(df)*100,

            # percentage of observations per category, cca data
            new_df['education_level'].value_counts() / len(new_df)*100
        ],
        axis=1)

# add column names
temp.columns = ['original', 'cca']

temp

The above values show that there is no significant difference between the categories of data before and after removing the missing values in the column ‘education_level’. So we can apply this technique.

What are parametric and Non-Parametric Machine Learning Models?

Introduction

medium.com

My LinkedIn profile

You can find the full code on my GitHub page

Know This Before Dropping The Missing Values

Various ways of handling the missing data

A Complete End-to-End Machine Learning Based Recommendation Project

A machine learning recommendation project based on collaborative filtering and popularity-based filtering

Complete Case Analysis

Practical Implementation of Content-Based Recommendation System

A complete end-to-end Content-based recommendation system that recommends similar movies based on the user’s input.

When to use CCA

How do I Verify the Assumptions of Linear Regression?

What are the assumptions of linear regression? and how to verify them with python?

Advantages of CCA

Disadvantages of CCA

Why Is Multicollinearity A Problem?

What is multicollinearity? and why we should take care of multicollinearity before creating a machine-learning model

Let us look at the practical example of CCA

Simple ways to write Complex Patterns in Python in just 4mins.

Easy way to write complex pattern programs in python

Standardization vs Normalization

Is feature scaling mandatory? when to use standardization? when to use normalization? what will happen to the…

Confusion Matrix to no Confusion Matrix in just 5mins

What is confusion matrix precision, recall , accuracy, F1-score, FPR, FNR, TPR,TNR ?

How Should We Detect and Treat the Outliers?

What are outliers? How do we need to detect outliers? How do we need to treat the outliers?

Encoding Categorical Data- The Right Way

Types of Data

What are parametric and Non-Parametric Machine Learning Models?

Introduction

Written by Gowtham S R