How Handling Missing Data Inappropriately Leads to Biased ML Models

Identify the types of missing data to build unbiased ML models

Najia Gul
Geek Culture
5 min readMar 7, 2021

--

Dealing with data where some attributes are missing is a very common challenge when building real-world ML models. It is therefore crucial to understand and study the missing data to learn whether the it has any correlation with other features or is just missing at random. Often, dropping missing data without proper analysis can lead to models that are biased in the real world.

This article is an adaptation of lectures from Course 2 of AI for medicine specialization by deeplearning.ai where the following case study has been discussed to imply the importance of proper analysis of missing data. The ideas presented are not mine.

In this article, we’ll do a complete case analysis to learn what happens when we don’t deal with missing data appropriately. We’ll look at a dataset of patients containing features patient ID, age, and BP to build a prognostic model to predict the 10 year risk of death.

Dataset of patients with 10 year risk of death

We can see that the data contain many values missing for the attribute BP of patients.

Creating train/test split

First, we’ll create a train/test split to be able to test the distribution by comparing histograms.

Splitting data into train and test sets

Dropping missing rows

A very naive approach to dealing with missing data is simply dropping patients’ records where the data are missing. Suppose we train a Random Forest Classifier on such a dataset and we find that the train and test accuracy is 87% and 84% respectively. We might be tempted to think that our model has a fair accuracy. Now we test the model on a new test set where there are no missing records. When we run the model on this new test set, we find that the accuracy drops down to 61%! So what went wrong here?

Evaluating classifier on new test set

To visualize what could have gone wrong, we can plot the distribution of the old and new datasets and compare their distribution for every feature side by side.

Comparing distributions of old and new test set

After comparing the distribution of the feature ‘Age’, we can clearly see that the prior distribution doesn’t match the distribution of our new test set. There are many more people with Age < 40 in the new test set as compared to old dataset since we’ve dropped a significant amount of rows.

Evaluating accuracy on subsets of patients

When we divide the input space on the basis of Age < 40 and Age > 40 and separately evaluate the model on both divisions, we see that the model performs poorly for younger patients on both the datasets. Since the number of younger patients in the new dataset is high, the low accuracy has a higher influence on the total accuracy.

Hence, we can clearly see that the model is bias towards younger patients. Was this perhaps due to dropping missing rows? Did dropping such rows change the distribution of the dataset? We can find that by plotting histograms of the dataset before and after dropping.

Comparing distribution of dataset before and after dropping rows

Types of missing data

Being aware of the possibility that the data might be missing by chance or due to the dependence of some other variable is useful. It helps to build models that are unbiased. Unfortunately, we can not generally be sure whether the data are really missing at random, or whether the missingness depends on unobserved predictors.

There are three cases of missing data:

Missing completely at random

Suppose a doctor decides to record BP of patients randomly (by flip of a coin). In this case, the missingness is not dependent on anything. The probability of any BP record missing is constant (0.5). For such a dataset, the distribution of dataset before and after dropping missing rows will be similar. For data missing completely at random, dropping rows do not lead to a biased model.

P(missing) = constant = 0.5

Distributions before and after dropping rows is similar

Missing at random

Now, suppose a doctor decides to always record BP for patients whose Age > 40, but if the Age < 40, the BP is recorded by the flip of a coin (0.5 probability). The distribution for such a case will not be consistent before and after dropping missing rows. This is because for patients with Age < 40, half of the data will be missing. For such data, missingness is dependent only on the available information. Here, age determines the probability of BP missing. Dropping rows in such a dataset can lead to biased models.

P(missing | age < 40 ) = 0.5 =/= P(missing | age ≥ 40) = 0

Distribution before and after dropping are not similar

Missing not at random

Lastly, suppose a doctor records BP only when there are no patients waiting at the clinic. When there are patients waiting, BP is not recorded. Here, the data is not missing at random. Furthermore, the variable ‘Patients waiting’ is not even a part of the dataset. It is unobserved when collecting data. We can say that the missingness is dependent on information that is unavailable and the probability of BP missing is not constant.

P(missing) =/= constant

The challenge here is that the distribution of such a dataset before and after dropping missing values will be similar, and we can not really tell by looking at the dataset that the data are not missing at random.

Distribution before an after dropping rows is similar even when data are not missing at random

Wrapping up

We’ve seen why it is important to study and analyse missing data before feeding it into ML models. Being aware of the consequences of dropping missing values helps in combating challenges of ethics in AI. As it is said, “A machine learning model is only as good as the data it is fed.” This may of course not be true in every situation, but data indeed have a high weightage on the quality of ML models.

--

--