Missing Value Imputation Methods using Python

Mohd Hassan Khan
5 min readJan 14, 2024

--

In any real-world data collection, missing values can occur due to various reasons like errors in data entry, non-response in surveys, equipment malfunctions, or data corruption. Missing value imputation refers to replacing missing data with substituted values in a dataset. If you want to learn the methods we can use for missing value imputation, this article is for you. In this article, I’ll take you through a guide to missing value imputation methods with implementation using Python.

Missing Value Imputation Methods

Below are some of the most commonly used missing value imputation methods used by Data Science professionals:

  1. Mean/Median/Mode Imputation
  2. Predictive Imputation
  3. Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB)

Let’s go through each of these methods of missing value imputation and how to implement them using Python.

A Guide to Missing Value Imputation Methods with Implementation using Python

Mean/Median/Mode Imputation

These are statistical methods of imputation to replace missing values with the mean, median, or mode of the available values in a dataset.

  • Mean Imputation: Replaces missing values with the mean (average) of the available values. This method is suitable for numerical data that does not have outliers, as outliers can significantly affect the mean.
  • Median Imputation: Replaces missing values with the median of the available values. It is more robust than mean imputation, especially for data with outliers or a non-normal distribution.
  • Mode Imputation: Replaces missing values with the mode (the most frequently occurring value). This method is used for categorical data.

Let’s assume we have a dataset with some missing values. We will create a small dummy dataset and replace the missing values using mean, median, and mode using Python:

import numpy as np
import pandas as pd

# Creating a sample data
data = {'Score': [25, np.nan, 30, np.nan, 29, 27, 32, 31]}
df = pd.DataFrame(data)

# Mean Imputation
df['Score_Mean'] = df['Score'].fillna(df['Score'].mean())

# Median Imputation
df['Score_Median'] = df['Score'].fillna(df['Score'].median())

# Mode Imputation
df['Score_Mode'] = df['Score'].fillna(df['Score'].mode()[0])

print(df)
Score  Score_Mean  Score_Median  Score_Mode
0 25.0 25.0 25.0 25.0
1 NaN 29.0 29.5 25.0
2 30.0 30.0 30.0 30.0
3 NaN 29.0 29.5 25.0
4 29.0 29.0 29.0 29.0
5 27.0 27.0 27.0 27.0
6 32.0 32.0 32.0 32.0
7 31.0 31.0 31.0 31.0

While these methods are simple and quick, they can lead to biased estimates if the missing data is not randomly distributed and can reduce the variability of the dataset, leading to underestimations of standard errors.

Predictive Imputation

Predictive imputation involves using statistical models to predict and fill in missing values based on the relationships observed in the rest of the data. Some methods include:

  • Regression Imputation: Uses a regression model to predict missing values based on other, related variables in the data.
  • K-Nearest Neighbors (KNN) Imputation: Identifies ‘k’ samples in the dataset that are similar to the observation with missing data and imputes values based on the average (or majority) of these ‘k’ neighbours.

For predictive imputation, let’s use k-nearest neighbours (KNN). We’ll use the KNNImputer from the scikit-learn library using Python:

from sklearn.impute import KNNImputer

# Assuming the same initial data with missing values
data = {'Feature1': [25, 20, 30, 40, 29, 27, 32, 31],
'Feature2': [20, 25, np.nan, 45, 30, 25, 35, 40]}
df = pd.DataFrame(data)

# Predictive Imputation using KNN
imputer = KNNImputer(n_neighbors=2)
df_filled = imputer.fit_transform(df)

print(df_filled)
[[25. 20.]
[20. 25.]
[30. 35.]
[40. 45.]
[29. 30.]
[27. 25.]
[32. 35.]
[31. 40.]]

Predictive methods generally provide more accurate imputations than simple statistical methods, especially when the data has complex relationships.

Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB)

These are imputation methods typically used in time series data or longitudinal studies where the ordering of observations is meaningful.

  • Last Observation Carried Forward (LOCF): Replaces a missing value with the last observed value before the missing one. It is based on the assumption that the best guess for a missing value is the one that was most recently observed.
  • Next Observation Carried Backward (NOCB): It is the reverse of LOCF. It replaces a missing value with the next observed value after the missing one.

For LOCF and NOCB, you can use pandas’ fillna() method with method arguments:

import pandas as pd

# Let's assume a time series data with missing values
time_data = {'Time': pd.date_range(start='1/1/2023', periods=8, freq='D'),
'Value': [1, np.nan, np.nan, 4, 5, np.nan, 7, 8]}
df_time = pd.DataFrame(time_data)

# LOCF
df_time['Value_LOCF'] = df_time['Value'].fillna(method='ffill')

# NOCB
df_time['Value_NOCB'] = df_time['Value'].fillna(method='bfill')

print(df_time)
Time  Value  Value_LOCF  Value_NOCB
0 2023-01-01 1.0 1.0 1.0
1 2023-01-02 NaN 1.0 4.0
2 2023-01-03 NaN 1.0 4.0
3 2023-01-04 4.0 4.0 4.0
4 2023-01-05 5.0 5.0 5.0
5 2023-01-06 NaN 5.0 7.0
6 2023-01-07 7.0 7.0 7.0
7 2023-01-08 8.0 8.0 8.0

While LOCF and NOCB are straightforward and often used in clinical trials or studies, they can introduce significant bias and underestimate the variability in the data, especially if the data shows trends over time or if the missing values are not randomly distributed.

How to Choose an Imputation Technique?

In practice, each of these methods should be chosen according to the nature of the data and the specific context of the missing data.

For instance, mean imputation may not be suitable for data with a non-normal distribution or with outliers, and LOCF/NOCB can introduce bias in time series analysis if the data have trends or seasonality.

Always perform exploratory data analysis to understand the patterns of missingness and the distribution of your data before deciding on an imputation technique.

Summary

So, below are some of the most commonly used missing value imputation methods used by Data Science professionals:

  1. Mean/Median/Mode Imputation: When dealing with missing data in a dataset, a simple approach is to fill in the missing values with the mean, median, or mode of the respective feature.
  2. Predictive Imputation: Predictive imputation involves using statistical models to estimate and replace missing values. It’s based on the relationships found in the other features of the data.
  3. Last Observation Carried Forward (LOCF): This technique fills missing values with the last observed (non-missing) value.
  4. Next Observation Carried Backward (NOCB): Contrary to LOCF, NOCB fills the missing values with the next observed (non-missing) value.

I hope you liked this article on a guide to missing value imputation methods with implementation using Python. Feel free to ask valuable questions in the comments section below.

--

--