ML Imputation Methods and Why They’re Important

Jesse
Data-Centric AI Community
7 min readSep 30, 2023
Photo by Jannes Glas on Unsplash

Comparing Different Imputation Methods for Missing Data Using Sklearn

Handling missing data (see prior information here) is a crucial step in the data preprocessing pipeline. Incorrect imputation can lead to biased or incorrect conclusions. In this blog, we’ll compare various imputation techniques using the `sklearn` library: Dropping Data, Statistical Imputation, and Machine Learning Imputation.

Picture this. Your a machine learning engineer for a healthcare company. You need to train a model that predicts if a patient will develop type 1 diabetes in the next few years. You need to make sure your model is using the best data possible. You don’t want any false positives and you definitely don’t want any false negatives.

There’s a problem. Not all the patient data you have for training is full. Some of the records contain missing values. Some doctor or nurse or patient didn’t quite fill out all of the recommended information.

In general, healthcare domains are highly affected by missing data. In turn, explanability is extremely important. For that reason, I’ve created this post to give you options and implementation of how to fill in those missing values.

The code is available here.

1. Dropping Data

The simplest way to handle missing values is to drop them.

Python Code:

import pandas as pd
# Load your dataset
data = pd.read_csv('your_dataset.csv')
# Drop rows with missing values
data_dropna = data.dropna()

Explanation:

Dropping data removes rows or columns that contain missing values. Although it’s the easiest method, it might not be the most efficient since you can lose a lot of information, especially if the dataset is small.

Medical Interpretation:

In a medical dataset, dropping data might mean eliminating significant patient records. This could lead to biased results if, for instance, the missingness is related to a specific condition or treatment.

2. Statistical Imputation

Statistical imputation involves replacing missing values with a central tendency measure like mean, median, or mode.

Python Code:

from sklearn.impute import SimpleImputer
# Mean imputation
mean_imputer = SimpleImputer(strategy='mean')
data_mean_imputed = pd.DataFrame(mean_imputer.fit_transform(data))
# Median imputation
median_imputer = SimpleImputer(strategy='median')
data_median_imputed = pd.DataFrame(median_imputer.fit_transform(data))
# Most frequent imputation
mode_imputer = SimpleImputer(strategy='most_frequent')
data_mode_imputed = pd.DataFrame(mode_imputer.fit_transform(data))

Explanation:

- Mean Imputation: Replaces missing values with the mean of the column. Suitable for continuous data.

- Median Imputation: Uses the median of the column. Less sensitive to outliers than the mean. Useful for continuous data with skewed distributions.

- Most Frequent (Mode) Imputation: Replaces missing data with the mode of the column. Ideal for categorical data.

Medical Interpretation:

Imagine a dataset containing blood pressure values. Using mean imputation might introduce unrealistic blood pressure values, especially if the data distribution is skewed. Using median or mode might be more appropriate.

3. Machine Learning Imputation

This involves using algorithms to estimate missing values based on other data points.

a) KNN Imputation:

K-Nearest Neighbors (KNN) can be used to impute missing data by finding the ‘k’ training samples closest in distance.

Python Code:

from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=5)
data_knn_imputed = pd.DataFrame(knn_imputer.fit_transform(data))

b) MICE (Multiple Imputation by Chained Equations):

MICE is a statistical method that replaces missing data with multiple imputations by modeling each feature with missing values as a function of other features.

Python Code:

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
mice_imputer = IterativeImputer(max_iter=10, random_state=0)
data_mice_imputed = pd.DataFrame(mice_imputer.fit_transform(data))

Explanation:

- KNN Imputation: This method assumes that similar data points exist in the dataset, and a missing value in a data point can be approximated as the average of the values of its ‘k’ nearest neighbors.

- MICE: A more advanced method that accounts for interdependencies between variables. It performs multiple imputations, creating several datasets. It can capture the uncertainty of imputed values.

Medical Interpretation:

Suppose you’re predicting the likelihood of a disease, and you use KNN imputation. If similar patients (based on other features) have the disease, the missing values could be imputed with values that lean towards having the disease. This can affect the outcome and could have real-world implications if not correctly interpreted.

Effects on Explanability:

1. Dropping Data: Easy to explain, but might lead to biased results.
2. Statistical Imputation: Relatively easy to explain. However, the choice between mean, median, or mode should be justifiable.
3. Machine Learning Imputation: Harder to explain to non-technical stakeholders. The models might capture complex patterns, but this complexity comes at the cost of interpretability.

Complexity of Machine Learning Based Imputation

1. Introductory Complexity of Machine Learning-based Imputation:

Imputation techniques based on machine learning models, such as K-Nearest Neighbors (KNN) and Multiple Imputation by Chained Equations (MICE), inherently introduce a layer of complexity to the data preprocessing pipeline. Unlike basic statistical imputations, which rely on straightforward measures of central tendency, these advanced methods employ iterative algorithms and pattern recognition to estimate missing values. The challenge begins when one attempts to decipher the underlying mechanisms by which these imputations occur, making it a non-trivial task for stakeholders.

2. KNN Imputation: Navigating Hyperparameters:

With KNN imputation, the challenge starts with the determination of the optimal ‘k’ — the number of nearest neighbors to consider. Different values of ‘k’ can lead to different imputed values, causing model performance to vary. When explaining the model, one has to justify the chosen ‘k’ value, understand its implications on the imputed data, and the repercussions on the resultant model’s performance. This introduces an additional dimension of hyperparameter tuning and can convolute the overall understanding of the model’s behavior.

3. MICE: Iterative Estimation and Chain Equations:

MICE, on the other hand, uses a series of regression models, iterating over each feature with missing data to impute it based on other variables. Each iteration refines the imputed values. Explaining the dynamics between variables during each iteration, how the equations are chained, and the convergence criteria for the imputed values can be overwhelming, especially for stakeholders without a deep statistical background.

4. Dependency on Feature Interactions:

Both KNN and MICE assume that the dataset’s features have inherent relationships. KNN relies on the premise that similar data points (based on feature distances) should have similar values for missing attributes. MICE operates under the assumption that one can model each feature with missing values as a function of other features. Understanding and explaining these intricate dependencies can be challenging, especially when certain features exhibit multicollinearity or when the relationships are non-linear.

5. Masking the Impact of Missingness:

A pivotal concern in model explainability is understanding the impact of missing data on the final model. With simplistic imputations, such as mean or median, the imputed values don’t add variance. However, KNN and MICE, due to their adaptive nature, can mask the true impact of missingness. This obfuscation can lead to overconfident predictions or unwarranted trust in the model’s capability to handle missing data seamlessly.

6. Handling Large Proportions of Missing Data:

When a substantial proportion of the data is missing, KNN and MICE can generate synthetic values based on existing patterns. While this may seem beneficial, it can artificially inflate the model’s performance metrics. Extrapolating insights from a model trained on extensively imputed data can be a precarious endeavor, given the challenge in deciphering genuine data-driven insights from imputation-induced artifacts.

7. Stochastic Nature of MICE:

Unlike KNN, which is deterministic given a fixed ‘k’, MICE is stochastic. Each run might yield slightly different imputations, leading to variability in model performance. This unpredictability, combined with the iterative nature of MICE, complicates model auditing, reproducibility, and validation, as stakeholders grapple with understanding the source of model variance.

8. Compounding with Model Complexity:

When KNN or MICE imputation is paired with inherently complex models like neural networks or ensemble methods, the resultant system becomes a compounded black box. The imputation intricacies, combined with the model’s architecture, can make it nearly insurmountable to generate comprehensive, clear, and actionable model explanations, hindering interpretability.

9. Evaluation Metrics and Overfitting:

A key aspect of model development is evaluating performance using appropriate metrics. With KNN or MICE, there’s a risk of overfitting to the imputed values, especially if the imputation overly conforms to the patterns in the training data. This can manifest as optimistic performance metrics during training but disappointing results on unseen data. Discerning whether poor generalization is due to model architecture, imputation strategies, or other factors becomes a convoluted task.

10. Conclusion: A Balance of Precision and Explainability:

In the realm of data science, precision and explainability often stand at opposite ends of the spectrum. While KNN and MICE imputation can capture intricate data patterns and potentially enhance model accuracy, they introduce significant complexity in the model explainability landscape. Stakeholders, especially in critical domains like healthcare or finance, need to weigh the benefits of increased precision against the challenges of deciphering and communicating the nuances of such advanced imputation techniques.

Conclusion:

Choosing an imputation method depends on the nature of the dataset, the amount of missing data, and the importance of interpretability. In medical datasets, where interpretability and accurate representation of data are crucial, one should choose the imputation technique with care, ensuring that the method does not introduce bias or misinterpretations.

About Me

Data Science Consultant, Generative AI enthusiast, AI Researcher and obsessed with the space. I aspire to help others, contribute to the AI space and keep learning all that I can about this amazing field of study and industry. I currently work as an AI Consultant and write to share the knowledge I gain from that work and other open source contributions.

Linkedin | Github | Portfolio

--

--

Jesse
Data-Centric AI Community

AI educator empowering tech founders and marketers to innovate with Generative & Decentralized AI. Unlock your potential at smartdecentralizedai.com