An efficient approach to Handle Missing Data.

Published in

Analytics Vidhya

4 min readMar 20, 2020

Most of the Real-World datasets available have missing entries in them. So in order to apply Machine learning algorithms on such datasets, we need to handle the missing values in them.

For many projects concerned with data science, the engineers spend plenty of time to preprocess the data and handling missing values is a crucial step to prepare the data for the algorithm.

This situation mostly occurs as a result of manual data entry procedures, equipment errors, and incorrect measurements.

First, let us know more about these missing values. There are three kinds of missing values, namely -

MISSING COMPLETELY AT RANDOM (MCAR) -Values in a data set can miss completely at random (MCAR) if the events that lead to any particular data-item missing are independent both of observable variables and unobservable parameters of interest, and occur entirely at random. It occurs when someone just forgets to enter the data in a particular cell, or someone is just lazy to fill it, nothing serious here.MCAR means there is no relationship between the missingness of the data and any values, observed or missing.
MISSING AT RANDOM (MAR)- Occurs when the missing-ness is related to a particular variable, but it is not related to the value of the variable that has missing data. An example of this is accidentally omitting an answer on a questionnaire. What it means, is that the missingness of data can be predicted by other features in the dataset
MISSING NOT AT RANDOM (MNAR)- This is data that is missing for a specific reason (i.e. the value of the variable that is missing is related to the reason it is missing). An example of this is if a certain question on a questionnaire tends to be skipped deliberately by participants with certain characteristics. The fact that the data is missing is related to the unobserved data, i.e. the data that we don’t have, the missingness is related to factors that we didn’t account for.

Now let us discuss the most common ways in which we handle the missing values and what are the problems associated with them.

Do Nothing — It is the simplest way, there are some algorithms (like XGboost.) which handle the missing values themselves. While many algorithms throw errors if there is missing data. So if you are using an algorithm that takes care of the missing values by itself, then go on.
Replacing the missing values by taking the mean of the column, in which the missing value exists.
Replacing the missing values by taking the median of the column, in which the missing value exists
Replacing the missing values by taking the mode of the column, in which the missing value exists
By imputing zero or a constant value in place of the missing values.

These methods seem easy to use but they are not probably the best choice for an accurate prediction.

While all these methods are easy to implement and work well with a small dataset, there are some repercussions if the dataset is large.

They do not consider the correlation between features in the dataset and work only at the column level.
It can make data bias(through zero and constant value).
Do not give an accurate result.
May not represent the correct data form(eg. the columns of integers can have a float value if imputed using the mean)

Now let us discuss what is needed to be done to get a relatively more accurate result if there are missing values in the dataset. To show this we consider the following dataset.