Know This Before Dropping The Missing Values

Gowtham S R
6 min readNov 22, 2022

--

What is the proper way to handle the missing values in the dataset? when should I drop the missing values? When should I impute the missing values?

Photo from Unsplash uploaded by Brett Jordan

“You can have data without information, but you cannot have information without data.”

(Daniel Keys Moran, American Fiction Writer)

We know Machine learning algorithms are not capable of handling missing values. As data scientists, it is our responsibility to handle the missing values properly.

There are many ways in which we can handle missing values, in this blog, I will show a technique called CCA in detail.

Table of Contents:

Various ways of handling the missing data
Complete Case Analysis
When to use CCA
Advantages of CCA
Disadvantages of CCA
Let us look at the practical example of CCA

Photo from the author

Various ways of handling the missing data

Either remove the missing rows altogether(complete case analysis) or drop the whole column(If the missing percentage in any of the columns is more).

Impute the missing values with the appropriate technique. We have univariate and multivariate techniques.

We can impute numerical data with mean or median value, with the random value, or with the end of distribution values.

Categorical data can be imputed with either mode or by creating a new category like ‘missing’.

We have multivariate techniques like KNN Imputer and iterative imputer.

Let us look at the first technique in detail.

Complete Case Analysis

Complete Case Analysis(CCA) also called ‘list-wise deletion’ of cases, consists of discarding observations where values in any of the variables are missing.

In other words, complete case analysis means literally analyzing only those observations for which there is information in all of the variables in the dataset.

When to use CCA

1. When the data is missing completely at random(Missing data Completly at Random).

2. As a thumb rule, we can apply this technique only when less than 5% of the data is missing.

If we have a dataset with 1000 values and 50 values are missing, we can drop them only when we are sure that those 50 values are missing randomly. And we should keep in mind that there should not be any significant difference in the distribution of the data before and after dropping the missing values.

As a result, it is as good as dropping 50 random rows from the dataset, and the distribution of the data remains the same.

Advantages of CCA

  • Easy to implement, we do not have to make any data manipulation.
  • It preserves the data distribution(if data is missing completely at random). Since there will not be any major difference between the distribution of the data before and after dropping the values.

Disadvantages of CCA

  • If there are many missing values, then we lose most of the observations.
  • We are losing the information in the other columns.
  • When using the models in production, the model will not have any know-how to handle the missing values.

Let us look at the practical example of CCA

import pandas as pd 
import numpy as np
df = pd.read_csv('data_science_job_data.csv')

df.shape
(19158, 13)
df.isnull().sum()/len(df)*100

Let us read the data and check the percentage of missing values.

From the above code, we can see that 5 columns are eligible for applying the CCA technique.

len(df[missing_columns_5percent].dropna())/len(df)*100

89.68577095730244
new_df = df[missing_columns_5percent].dropna()
df.shape, new_df.shape

((19158, 13), (17182, 5))

After applying CCA, we will have 89.6% of the total data. And 17182 observations.

Let us look at the data distribution, before and after dropping the missing rows for each eligible column.

  1. Dropping the missing rows in the column ‘city_development_index’
plt.figure(figsize=(10,6))
df['city_development_index'].plot.density(color = 'red')
new_df['city_development_index'].plot.density(color = 'green')
plt.show()

The above plots show that there is no significant difference between the distribution of data before and after removing the missing values in the column ‘city_development_index’. So we can apply this technique.

2. Dropping the missing rows in the column ‘experience’.

plt.figure(figsize=(10,6))
df['experience'].plot.density(color = 'red')
new_df['experience'].plot.density(color = 'green')
plt.show()

The above plots show that there is no significant difference between the distribution of data before and after removing the missing values in the column ‘experience’. So we can apply this technique.

3. Dropping the missing rows in the column ‘training_hours’.

plt.figure(figsize=(10,6))
df['training_hours'].plot.density(color = 'red')
new_df['training_hours'].plot.density(color = 'green')
plt.show()

The above plots show that there is no significant difference between the distribution of data before and after removing the missing values in the column ‘training_hours’. So we can apply this technique.

4. Dropping the missing rows in the column ‘enrolled_university’.

temp = pd.concat([
# percentage of observations per category, original data
df['enrolled_university'].value_counts() / len(df)*100,

# percentage of observations per category, cca data
new_df['enrolled_university'].value_counts() / len(new_df)*100
],
axis=1)

# add column names
temp.columns = ['original', 'cca']

temp

The above values show that there is no significant difference between the categories of data before and after removing the missing values in the column ‘enrolled_university’. So we can apply this technique.

5. Dropping the missing rows in the column ‘education_level’.

temp = pd.concat([
# percentage of observations per category, original data
df['education_level'].value_counts() / len(df)*100,

# percentage of observations per category, cca data
new_df['education_level'].value_counts() / len(new_df)*100
],
axis=1)

# add column names
temp.columns = ['original', 'cca']

temp

The above values show that there is no significant difference between the categories of data before and after removing the missing values in the column ‘education_level’. So we can apply this technique.

My LinkedIn profile

You can find the full code on my GitHub page

--

--