Handling Missing Values with Mean & Median Imputation in R

Published in

Analytics Vidhya

5 min readJun 26, 2021

Imagine for a moment that you have to cross that river. Would you swim through it? Or would you take the bridge? I’ll give you 1 minute to choose.

Ok, time’s up!

If you are not a professional swimmer as I am, we both agree that the best way to reach our destiny is via the bridge. But as you might noticed, there are some gaps that threatens our security.

That’s the way Machine Learning algorithms work. Like a bridge, they take you from point A (observation) to point B (prediction). However when there are missing values, they resemble to those gaps in your path that does not allow you to advance.

The previous analogy is not only figurative, it’s real! The way the algorithgms work internally is through vectors. And like an old movie’s bank vault, when there are no values stored, there are serious problems…

Why Missing Values exist?

There are a lot of reasons to that question. However, they can be categorized in 3 types:

Missing Completely at Random (MCAR)

As the name implies, the probability of missing values is the same for any feature. In other words, there’s no relation between the missing value for column A at row 35 and the missing value for column 11 at row 1,385. They are independent.

Some examples are the errors of an API, missing laboratory values because of bad samples, a random subset of citizens that didn’t vote, etc.

Missing at Random (MAR)

In contrast to the previous, there’s a relation between the missing values and some features in the data set.

For example, in a survey it might be possible that some women prefer to skip the question about her weight (the feature).

Missing Not at Random (MNAR)

Finally, this classification is present when any factor causes data to miss. For example, for a socialeconomic study it’s highly probably that low and high income people might not fill the survey.

Now that you are familiar with the topic, which one do you think applies to our scary bridge we want to cross?

Enough theory. How we can handle them?

The answer is simple. Imputation.

I know. The bridge is not the same. But the logic is!

As the guy did, we can use similar values in our favor in order to achieve our goal. For our hypothetical situation we can reuse any previous tables to “fill the gaps” and reach our destiny. In Machine Learning we can replicate this action by taking some statistical measurements you may know and fill the missing values.

The following list are some of the most used methods:

Mean
Median
Mode
Min or Max values
KNN
Regression

For the scope of this introductory post, we will be covering the mean and median methods.

Let’s see an example with code

For the following example, we will be using the House Prices dataset from the Kaggle Competition.

First, we import the dplyr and ggplot2 libraries for data analysis and visualization, respectively. Also, we import the dataset. For this example we will use the train_HP dataframe.

Image 1: Importing libraries and dataset.

Our first step will be identify the features with missing values. For that, we create the colsNa vector with the complete.cases function. Note that the exclamation mark inverts its main purpose. According to the result, there are 19 features with that condition.

That will help us to create the incompleteData data frame that contains only the features with missing values.

Image 2: Identifying the features with missing values.

Our second step will be divide our missing values through a threshold. In order to assign a value for each feature, we calculate the mean via apply function. Note that the arguments are the number of missing values (is.na function) in each column.

The next step is to create a data frame with that.

Image 3: Setting a proportion of missing value for each feature.

As you can see, there are variables with high proportions of missing values. You should always make sure that those variables doesn’t have an effect in the model before you remove them. In this case, we will leave them out.

It’s time to set our threshold. It is common to set it at 5% but you can use the value that fits your needs. The resultant data frame are the values we should work with our imputation.

Image 4: Setting our threshold of 5% and final data frame.

In this entry we will work just with one variable. It will be LotFrontage. So we replace all the missing values with the mean (first statement), median (second statement) and compare.

Image 5: Comparison between mean and median imputation.

Did you see the results? The mean value is 70.04996 meanwhile the median is 69. Let’s check this in a graph.

Image 6: Line graph of the mean and median imputation.

Ok, it’s difficult to distinguish. But the idea is that both imputation methods helped us to fill those gaps that we had at the beginning.

Notice that what it seemed to be a “normal distribution” before, it is not anymore. So you should choose well the method to work with in order to keep the data as normal as possible.

Pros and Cons

Pros

Easy and fast to use

Cons

Distorts the distribution

In conclusion

Hurray! We crossed the river via the gappy bridge :)

Just to summarize: Machine Learning are bridges that take you from point A to point B. However, sometimes those bridges are full of gaps. Or in our jargon, missing values.

In order to “fill those gaps” you should take advantage of the data you know. How? By statistical tools that we all know, let’s say mean, median, mode, regression, KNN and others.

In this blogpost we used the mean and median imputation method. As we noticed, it is very fast and easy to implement, but sometimes it distorts our data’s distribution. So always proceed with care.

Friend’s advice: try as many methods as possible. It’d be like filling those gaps with wood, metal, leaves, vines, etc. Eventually you’ll find the method that best fits to the bridge you are trying to cross.