Dealing with Missing Data from Zero to Advanced

Hasan Hüseyin Coşgun
8 min readAug 23, 2023

--

Simple and advanced imputation; Drop, Mode, Median, KNN and MICE

Designed via Canva

Dealing with missing data is a prevalent and inherent challenge in data collection, particularly when dealing with extensive datasets. Numerous factors contribute to missing data, including participants providing incomplete information, non-responses from individuals who decide not to share data, poorly designed survey instruments, or the necessity to exclude data due to confidentiality concerns.

This article aims to delve into various techniques for effectively managing missing datas, with a specific focus on their implementation in Python. We will illustrate the advantages and disadvantages of each technique, enabling you to select the most suitable approach for a given situation.

Types of Missing Data

When assessing the potential impact of missing data on registry findings, it is important to consider the underlying reasons for missing data. Missing data is grouped into three categories:

  • Missing completely at random (MCAR)
  • Missing at random (MAR)
  • Not missing at random (NMAR)

But we will continue without explanation on this section, let’s continue to focus on the application part. Let’s leave a useful link.

How to identify missing data ?

Several techniques can be employed to detect missing values in pandas. Presented below are some of the most commonly used methods for identifying missing values in the data.

Let’s get to know the dataset

You can find the source code and notebook here.

The data used in this study originates from the National Institute of Diabetes and Digestive and Kidney Diseases. The main goal of this dataset is to predict the presence or absence of diabetes in patients, utilizing specific diagnostic measurements present in the dataset.

In real cases, many columns may have missing data. That’s why we need to understand how to approach it from many different perspectives. Good news😅 There is a lot of missing data in our sample.

We can observe the missing data in the data set. Let’s analyze a little more and see what kind of picture we are facing. However, the Insulin variable seems to have quite a lot of missing values. Visual analysis provides faster and more meaningful data, so let’s solve it with missingno.

missingno
missing_plot

Handling Missing Data

In this article, we will only work with the insulin variable to observe the effects of the imputation methods. Next is working with a variable with 48.7% missing data.

It may not always be logical to apply the same operations to columns with a small number of missing data and columns with a larger number of missing values. For example, a more accurate data set can be obtained by presenting different solutions to lost data at different rates. For this reason, we continue by writing a function to detect the data proportionally.

You can access the notebook here for a detailed review.

1. Data Dropping

Columns with a missing values of more than 70% are dropped and the analysis continues. However, this rate in our data set does not exceed 50%.

We may want to apply different operations according to the proportions of missing values in the data set. In our example, we want to drop the independent variables that have more than 25% missing values.

Access now the variables names we want to drop:

If we wanted to do it we would have continued like this:

2.Simple Imputation Methods

The scikit-learn library provides a useful tool called the SimpleImputer function for addressing missing values in datasets. This function allows us to replace these missing values with a specified fill value. Within the SimpleImputer function, there exists a parameter known as strategy. This parameter offers four different options(mean(default), median, most_frequent and constant), each representing a distinct imputation method.

2.1 Mean Imputation

The average value of existing values:

Let’s see what happens when we fill the missing data with mean values.

2.2 Median Imputation

Data distribution after Mean and Median imputation

Most of the imputation technique can cause bias. Simple imputation can result in an underestimation of standard errors. Simple imputed data for any statistic can lead to an underestimation of the standard error. As the number of missing data increases, simple imputation methods should be avoided.

3. Advanced Imputation Methods

3.1 K-Nearest Neighbour (KNN) Imputation

One commonly adopted strategy for addressing missing data is to employ a predictive model to estimate the absent values. This technique entails developing a separate model for each input variable containing missing entries.

k-nearest neighbors

While various models can be employed for this purpose, the k-nearest neighbor (KNN) algorithm has demonstrated consistent effectiveness and is commonly known as “nearest neighbor imputation.” This method involves identifying the nearest neighbors to the missing data points and leveraging their values to make informed imputations.

The default value of K is set to 5. Although there is no definitive method for determining the ideal value of K, a commonly used heuristic suggests that the optimal K is often the square root of the total number of samples in the dataset. Typically, an odd value is chosen for K to prevent ties in decision-making. To identify the most suitable K, an error plot or accuracy plot is commonly used.

KNN uses a variety of distance metrics to calculate distances. It’s essential to choose the most suitable distance metric for the algorithm to function effectively. Here are a few examples of these metrics:

Distance Measures

Now it’s time to code

K-Nearest Neighbors (KNN) imputation tends to incur higher computational costs compared to simple imputation methods. However, it’s important to note that KNN imputation remains effective for datasets that do not exceed the scale of tens of millions of records. However, as we can see from the scatterplots, knn seems to have completed the missing values in a way that does not distort a normal distribution.

3.2 Multivariate Imputation by Chained Equation — MICE

MICE Imputation, short for ‘Multiple Imputation by Chained Equation’ is an advanced missing data imputation technique that uses multiple iterations of Machine Learning model training to predict the missing values using known values from other features in the data as predictors.

How does MICE algorithm work?

Here is a quick intuition (not the exact algorithm)

1. You basically take the variable that contains missing values as a response ‘Y’ and other variables as predictors ‘X’.

2. Build a model with rows where Y is not missing.

3. Then predict the missing observations.

Do this multiple times by doing random draws of the data and taking the mean of the predictions.

Conclusion

In this article, we analyze the results by applying different methods on how to deal with missing data. In the first stage, we identified missing data and analyzed the density in the data set visually and proportionally. We continued by coding how to deal with the detected missing data.

  • Removing missing data from the dataset seems to have increased the impact of outliers on std.
  • In mean and median imputation methods, it increases the weight at a single point. This is why std is being suppressed and causing the range to get shorter.
  • Knn and mice solve the problem of missing values without disturbing the normal distribution. In addition, mice seem less affected by outliers.

First, we analyzed the results using dropping and simple imputation methods. Mean and median imputation for a variable with a high proportion of missing data gives biased results. Or it distorts the correlation with other variables. But we chose to go a little deeper and collaborate with other variables in the dataset. In this way, we realized that instead of considering variables separately from the dataset, it is more accurate to solve this problem by considering all variables together.

Among the advanced imputation methods, we used KNN and MICE to fill in the missing data. The result gave a fairly accurate distribution compared to simple techniques.

You can find the source code and notebook here.

--

--