An efficient approach to Handle Missing Data.

Nikhil Vyas
Analytics Vidhya
Published in
4 min readMar 20, 2020

Most of the Real-World datasets available have missing entries in them. So in order to apply Machine learning algorithms on such datasets, we need to handle the missing values in them.

For many projects concerned with data science, the engineers spend plenty of time to preprocess the data and handling missing values is a crucial step to prepare the data for the algorithm.

This situation mostly occurs as a result of manual data entry procedures, equipment errors, and incorrect measurements.

First, let us know more about these missing values. There are three kinds of missing values, namely -

  • MISSING COMPLETELY AT RANDOM (MCAR) -Values in a data set can miss completely at random (MCAR) if the events that lead to any particular data-item missing are independent both of observable variables and unobservable parameters of interest, and occur entirely at random. It occurs when someone just forgets to enter the data in a particular cell, or someone is just lazy to fill it, nothing serious here.MCAR means there is no relationship between the missingness of the data and any values, observed or missing.
  • MISSING AT RANDOM (MAR)- Occurs when the missing-ness is related to a particular variable, but it is not related to the value of the variable that has missing data. An example of this is accidentally omitting an answer on a questionnaire. What it means, is that the missingness of data can be predicted by other features in the dataset
  • MISSING NOT AT RANDOM (MNAR)- This is data that is missing for a specific reason (i.e. the value of the variable that is missing is related to the reason it is missing). An example of this is if a certain question on a questionnaire tends to be skipped deliberately by participants with certain characteristics. The fact that the data is missing is related to the unobserved data, i.e. the data that we don’t have, the missingness is related to factors that we didn’t account for.

Now let us discuss the most common ways in which we handle the missing values and what are the problems associated with them.

  • Do Nothing — It is the simplest way, there are some algorithms (like XGboost.) which handle the missing values themselves. While many algorithms throw errors if there is missing data. So if you are using an algorithm that takes care of the missing values by itself, then go on.
  • Replacing the missing values by taking the mean of the column, in which the missing value exists.
  • Replacing the missing values by taking the median of the column, in which the missing value exists
  • Replacing the missing values by taking the mode of the column, in which the missing value exists
  • By imputing zero or a constant value in place of the missing values.

These methods seem easy to use but they are not probably the best choice for an accurate prediction.

While all these methods are easy to implement and work well with a small dataset, there are some repercussions if the dataset is large.

  • They do not consider the correlation between features in the dataset and work only at the column level.
  • It can make data bias(through zero and constant value).
  • Do not give an accurate result.
  • May not represent the correct data form(eg. the columns of integers can have a float value if imputed using the mean)

Now let us discuss what is needed to be done to get a relatively more accurate result if there are missing values in the dataset. To show this we consider the following dataset.

First, let us import the libraries and the dataset.

Now let’s divide the dataset as follows

‘X’ — contains all the data entries which had no missing values.

‘Y’ — contains the columns which have the missing(nan) values.

‘X_missing_rows’ — contains the data entries(rows) which contain the ‘nan’.

Now in this example, we are applying random forest regression as it gives pretty good results, but you can try it with any other algorithm you like.

The missing values in our case are -

References —

https://www.slideshare.net/akanniazeezolamide/missing-data-and-causes

https://medium.com/@danberdov/types-of-missing-data-902120fa4248

I hope that you find this article useful and now know what are the problems faced in handling missing data and how to resolve them efficiently.

--

--

Nikhil Vyas
Analytics Vidhya

Thapar University ; COE-2021;Machine Learning;Django