Types of Missing Data

Dan Berdikulov
4 min readApr 1, 2019

--

Photo by Campaign Creators on Unsplash

Most of the effort with the project connected with data is spent on data preparation, sometimes it can take up to 90 percent of the overall time spent on the project. Dealing with missing data is one of the most difficult parts in the data preparation phase. One of the reasons that it is considered difficult is that there is no best way to deal with missing values.

In order to understand what to do with missing values found in your dataset, firstly, you need to understand what type of missing values you have. When I first faced the problem of missing data, it was difficult to understand the meaning of their types, that’s what I will try to explain in this article with simple and clear examples.

Three kinds of missing data:

  • Missing at Random (MAR)
  • Missing Completely at Random (MCAR)
  • Missing Not at Random (MNAR)

Let’s imagine that we are trying to predict the price of the car that is being sold, for example in eBay, the data may look like this:

This example will help us understand the different types of missing values.

Missing at Random (MAR)

MAR data — means there is a systematic relationship between the propensity of missing values and the observed data, but not the missing data. What it means, is that the missingness of data can be predicted by other features in the dataset. Still confused? Fear not. In the table above you can see that the mileage has few missing values. If we analyze the dataset we can see that the manufacturing year of cars with missing values is lower than in other examples.

Percentage of missing values in the mileage column depending on the year of the car.

Above, you can see the illustration that was made from real data in one of my previous projects. You can clearly see the correlation between the percentage of missing values in the mileage column and the manufacturing year of the car. From the illustration, we can assume that the older the car, the more the probability that the mileage will not be provided by the seller of the car. There can be a lot of reasons to not provide mileage for the old car, but the most important reason is — old cars have more mileage than the new ones. And the more the mileage is, the more the car has been used, which greatly affects its price. So, the long story short, we can predict the missingness of the mileage of the car, from its manufacturing year.

Missing Completely at Random (MCAR)

MCAR means there is no relationship between the missingness of the data and any values, observed or missing. This kind of missing values is the easiest to understand. The fact that the data is missing has nothing to do neither with observed data nor with non-observed data, it’s just missing. There is no logic in it. From the illustration provided above, we can see that there is missing value in the color column, there can be a lot of reasons for its missingness, but not the systematic one. Someone just forgot to mention the color, someone was lazy to do it, maybe it was a problem with a system, nothing serious here.

Missing Not at Random (MNAR)

MNAR data is the most complicated one both in terms of finding it and dealing with it. The fact that the data is missing is related to the unobserved data, i.e. the data that we don’t have, the missingness is related to factors that we didn’t account for. I know, that’s sad.

The easiest way to understand why the data is missing is to understand the data collection process, also there are statistical methods to understand whether the data is MCAR or MAR.

Wrapping it up

Missing data is complicated, but also it’s fun to work with. There are different approaches for types of missing values that can highly affect the overall result of the project that you are working on. The only goal of this article was to cover different types of missing values. Dealing with missing data is the next article, that can be found here.

Click the 💚, if you like the article, so more people can see it here on Medium. Don’t forget to check out the next article about dealing with missing values. If you have any questions, you can write them in the comments section below, and I will do my best to answer them. Also, you can email me directly or find me on LinkedIn.

--

--

Dan Berdikulov

A data scientist with excellent expertise in machine learning, with previous experience in statistics and developing new products.