Pump it up — How to deal with missing data?

5 min readDec 27, 2021

This is the second article in a set of four describing my workflow in the Driven Data Pump it Up competition. Click here to read the first article on EDA.

MCAR, MAR and MNAR

Missing data is something we have to deal with in virtually any dataset we come across. Data can be missing for a multitude of reasons. The reason behind this missingness is relevant, because it influences your imputation options. Data can be Missing Completely at Random (MCAR). This means that there is no systematic mechanism that can explain why certain data points are missing. In the case of GPS height, this would mean that missingness cannot be explained by factors like the location of the water point, the person who recorded the GPS height, the time of year or any other factor. It is simply missing completely at random.

Now consider that the GPS height is not missing completely at random. In the case that the GPS height is less likely to be recorded during the rainy season, this would result in data being Missing at Random (MAR). Why is this relevant? Well, the missing and available data no longer come from the same distribution, so you cannot use the mean of the available data to impute the missing GPS heights.

Even worse, in cases where the GPS height is only recorded for pumps up to an altitude of a 1.000 meters, the data would be Missing Not at Random (MNAR). If this were the case, we would have no information in our dataset to impute these missing values.

Is our data missing at random?

So, is our data MCAR or MNAR? To answer this question, I use a python Package called Missingno. This library has some great options to visualize the missingness of your dataset and to see how missingness in one feature correlates with that in another feature.

I created a Missingno matrix to see if there are any patterns in my missingness and wow, some of my data is certainly not missing completely at random. There is a pretty strong correlation between the missingness in construction year, population and GPS height.

What imputation technique should you choose?

There are many techniques you can use to impute missing data and each comes with its own assumptions, advantages and disadvantages. A popular technique is to impute missing data by the mean, mode or median of the available data. It is easy to perform and works well when your data is missing completely at random.

A major limitation of using this method is that it tends to reduce the variance in your data, especially when a large percentage of it is missing. This is problematic, because variance is what your model needs to detect patterns hidden in your data. An improvement of simple mean imputation is to use for example the mean GPS height of the region or another relevant feature in your dataset.

Another popular imputation strategy is MICE, short for Multiple Imputation by Chained Equations. MICE runs several iterations of regression models on your dataset and replaces missing values in a feature by a prediction that it has learned from your available data before it moves on to the next feature. MICE can be a bit more computationally, but it can produce great results when data is MAR.

Comparing imputation strategies

Let’s compare three imputation strategies and see how they impact the distribution and variance of the GPS height.

1. Impute by the overall mean of the GPS height

2. Impute by mean of the GPS height by sub village, lga, ward and region (i.e. work our way up from the sub village level to the region level based on available data on the location of the water point)

3. Impute using MICE

Which imputation strategy worked best?

MICE preserves the shape of GPS height and the amount of variance better than the other two strategies. Mean imputing has the most dramatic impact on the distribution of the GPS height, since all missing data is replaced by a single value. The variance after mean imputing is therefore the lowest. Manual imputing scores somewhere in the middle. It uses more different values to impute the missing data, but it only uses location-based information, whereas MICE uses all available information to impute the GPS height.

The main takeaway here is that It is worth to play around with different imputation strategies and to see which one works the best for your data.

All code using in this article can be found on Github.

In the next article we will discuss feature selection and feature engineering.

References and further reading

· Akinfaderin, W. 2017. Missing data conundrum: exploration and imputation techniques. https://medium.com/@WalePhenomenon/missing-data-conundrum-exploration-and-imputation-techniques-9f40abe0fd87

· Brownlee, 2020. Iterative imputation for missing values in machine learning. https://machinelearningmastery.com/iterative-imputation-for-missing-values-in-machine-learning/

· Hashi, Z. 2018. Difference between MAR, MCAR and MNAR missing data. https://www.linkedin.com/pulse/difference-between-mar-mcar-mnar-missing-data-zakarie-a-hashi/

Pump it up — How to deal with missing data?

Written by Brenda Loznik