Don’t be mean — I’m missing data

Andre Marques Leite
Blue Harvest Tech Blog
10 min readSep 12, 2022

--

You may have heard the term ‘missing data’ as often as you have heard your full name and to some people, an equivalent amount of fear is created by the term ‘missing data’ as when their partner uses their full name. Some may think they can navigate easily out of this anxious situation, but is it always that straightforward?

When you get a dataset for the first time then going back to basics is always a good idea. Does your data have missing data? What does this mean? Do you have to worry about the missing data? Can you just remove it or should you try to fill it? What method should be used to fill the gaps and how do you know if you can just remove them?

These are all questions that should go through your head when you receive data for the first time.

Why the reason behind missing data is important

During the data cleaning process, some techniques help us make decisions about how to deal with the most common scenarios in the datasets. However, before we get into the techniques, let’s go over some of the observed questions regarding data cleaning issues.

Why do we care about why data is missing?

Even with today’s large volume of data, missing data can lead to incorrect conclusions in the analytical part or lead to models with lower accuracy. This is especially relevant when the missing data is identified in important columns used for decision-making.

Even if, after consideration, the decision is to take no action on the missing data, it is important to have an idea of the amount of missing data and the potential vulnerabilities it can cause. The outcomes from analyzing the missing data can be used to alert others who may use the dataset or understand the results caused by the gaps in some records better.

Why should we spend time figuring out if there is missing data?

Figuring out how the missing values are distributed across the data has many consequences that can be split into three main areas:

  1. Time:
    Depending on what data is missing, the cleanup process can be very simple or even eliminated. Therefore, figuring out why information is missing can provide insight into how long it will take to process the dataset and what options are available.
  2. Quality:
    Knowing where gaps in the data and measuring the effects of the poor quality of the source gives us insight into the data. The insights can then be used to determine actions. Once the sample has been analyzed and mapped, it becomes clear to all stakeholders what errors are possible.
  3. Action:
    After identifying what the pattern of missing values in the dataset looks like, you can find a clear path forward with solid techniques and alternatives. So any actions taken after this discovery will lead to a better solution.

The time applied to observe the dataset will contribute to improved data quality and better decision-making.

Can’t we just always fill in the missing values with the mean?

In cases where only a small number of observations are missing, filling the gap with the mean is an option, however, depending on the magnitude, this may result in a loss of variation in the data.

The fact is that a few steps can give a better approach and understanding of how to proceed. The right approach can be easier than filling in the gaps with mean or median.

Categories of missing data

Now that we have a better understanding of why data could be missing and why it is important to know, we can look at the three categories why data could be missing:

  1. Missing Completely at Random (MCAR)
  2. Missing at Random (MAR)
  3. Missing Not at Random (MNAR)

This link explains missing data very well.

A quick summary of the types of missing data is:

  1. “Missing Completely at Random” is where data is missing across all observations. Data is missing due to external factors and not related to a value in the observation.
    Some data in the sample database, containing information about cats which we will use to demonstrate how to handle missing data, have gaps due to external factors like the scale’s battery being flat and not weighing a cat. The scale’s battery being flat is not one of the observed variables in the dataset.
  2. “Missing at Random” is where data on a partly missing variable is related to some other wholly observed variables in the analysis model but not to the values of the partially missing variable itself.
    If the cat is sick and could not attend his appointment at the vet then the weight measurement for that day will be missing. The missingness is related to the cat being sick. The cat being sick is an observed variable in the dataset.
  3. Missing Not at Random” is where data is missing due to an apparent reason. The missingness is specifically related to what is missing.
    A cat’s weight is not filled in on a questionnaire, because the owner is too shy about his overweight cat.

Some of the causes of missing data include accidental deletion of data, data that was not collected due to human error, or data that doesn’t exist.

The reason why we need to know if data is MAR, MCAR, or MNAR is to know how to delete or fill the data.

Visualizing missing data

A very good and easy way to find and identify missing data is by visualizing the dataset. By visualizing data you can get a very good idea about what is happening in the data. A very good link to use for knowing how to visualize your missing data is here.

Here are some of the very useful visualizations that are covered in the link above:

  • Visualize the completeness of the data.

> Records that have filled values are marked in grey. The number above each bar also indicates the number of records that have values.

> Here it is visible that the Weight and Age columns have many missing records.

> By looking at the numbers at the top you can see that the Hair Type column has a few records with missing values.

  • Visualize the position of the missing values

> The white lines show where missing values are.

> The Hair Type column has very few missing values and the missing values are not correlated to any other column’s missing values. This indicates that the missing values in the Hair Type column are Missing Completely at Random (MCAR) since the missing data is not correlated to another (missing) value.

> The Weight and Age columns both have a lot of missing values. We cannot directly observe the reason for the missingness of this graph.

  • Sorting the data for one of the missing columns and visualizing missing data

> We have now sorted the dataset on the Weight column.

> Here we can see that there is no relationship between the missingness of Weight and Age. Not all the missing values in the Age column are also missing in the Weight column.

> The data for Weigh and Age is therefore Missing Completely at Random.

  • If the data was MAR, the graph could look like the following:

Here you can see that the missing values in the Weight and the missing values in the Age column are correlated. For data that is MAR the partly missing column is related to a completely observed variable. Weight and Age may be both related to some completely observed variable. More tests will have to be done to determine the reason for the missing data, but it is a good indication that you should look closely at your missing data.

  • Non-visual test for MCAR vs MAR

It is very difficult to say with 100% certainty the reason for missing data. It is therefore good to visualize and enrich your assumptions with extra tests like Little’s Test. These tests determine whether or not variables are related to each other. If variables are not related to each other then the gaps are MCAR, otherwise, they are MAR. This test can help you determine whether the data is MAR or MCAR.

  • Visualizing MNAR:

It is very difficult to visualize MNAR data. It is good to have domain knowledge about the data you are investigating to determine whether or not the data is MNAR.

You need to specifically look at the rows that have missing values.

Imagine you have data about cats:

The column called num_grey_hairs has missing values

If you look at the num_grey_hairs column and plot the gender values we get the following:

Only male cats have missing values in the num_grey_hairs column.

The average age where the number of grey hair is missing is 16.5.

The average age for all males is 10.

The number of grey hairs for male cats ranges from 0 to 13.

The number of grey hairs for female cats ranges from 0 to 30.

When you look at these details then you can see that the older male cats did not fill in the number of grey hairs that they have since older cats will have more grey hair and it is not true that only female cats have more grey hair as they get older.
If you use the average of grey hair for male cats to fill the gaps then it will not be correctly filled and the model will be inaccurate. A more advanced method will therefore need to be used to fill the gaps or you will have to go back to the owners of the cats to get the correct data, but it is usually not possible to get the actual values to fill the gaps.

Why it is essential to know that data is missing

Providing data that is accurate and provides reliable results is an integral part of paying attention to each step in the data flow. Therefore, understanding the gaps in the data set may be important depending on the person working with the data’s role within the team. An approach is taken that considers the needs of developers, data analysts, and data scientists; but the totality of all reasons must be considered and attributed to a wide variety of roles in data teams. The needs of the different team members are described briefly below:

Developer

During the development of an application, you need to know if there is data that can cause errors in the execution, to avoid malfunction of the code. Some use cases have critical columns that need to be filled and if these values are left blank, it will cause failures.

Data analysts

In data analytics, it is very important to know the specifics of a data set. If the first thing you do to understand the data is finding out whether the dataset is incomplete, the analytics can take a certain direction from the first moment. It can provide new insights from the moment the missing data is detected. For example, if data from a survey is missing in certain categories, you can assume that there were technical difficulties or that the user needed to fill in that field, but did not for some reason. In this way, it is possible to provide insights to the relevant teams for analysis even before an in-depth analysis is done.

Data Scientist

The reliability of a model relies heavily on having complete critical information available for the model. Using datasets that contain large amounts of empty fields may invalidate the model because it has not matured on a reliable set.

When empty fields are discovered, the data scientist can take steps to increase the accuracy of the final model. For example, this is a procedure that produced a much better result for the data sets used in Kaggle for the Titanic case.

To use a dataset for all of the above tasks, it is important to take these steps to ensure a reliable data source:

  • Correction of deviating values and outliers.
  • Filling in missing information.
  • Creating new characteristics for analysis.
  • Conversion of fields to the correct format for calculations and presentations.

Suppose you consider these four steps when using datasets. In that case, the code executions on that database, the analyses, and the insights gained from the data or the model created will be much more reliable and create multiple insights in the data cleaning process.

Conclusion

It is not necessary to spend days looking at the gaps in your data. A quick investigation into your data can show whether or not there is missing data. If there are gaps then spending just a little more time looking into the gaps to determine the reason for the missing data will help you to know what method you should use to fill these gaps. Knowing the reason data is missing points you to the path you should take next to get your dataset ready for using the data.

As stated in this article, correctly identifying the category of missing data and treating it in the appropriate manner, leads to an increase in the accuracy of data models. As such, it is important to take the (little) time to identify the category of missing data before blindly inserting the mean or median. The next article is going into detail about how to fill or remove these gaps.

Authors
Frances Dreyer & Andre Marques Leite

--

--