Demystifying Missing Data Recovery in Machine Learning

Dagang Wei
3 min readFeb 7, 2024

--

Image generated with Gemini

This article is part of the series Demystifying Machine Learning.

Question:

I have a tabular dataset, every element should be available in theory, but unfortunately some elements are missing. Can you suggest a machine learning model to recover the missing data?

Answer:

Recovering missing data in a tabular dataset is a common task in data preprocessing, and there are several approaches to handle it, depending on the nature and pattern of the missing data, as well as the ultimate goal of your machine learning project. Here are some models and techniques that are often used for imputation (the process of replacing missing data with substituted values):

1. Simple Imputation

  • Mean/Median/Mode Imputation: Replaces missing values with the mean, median, or mode of the column. Suitable for numerical data (mean, median) and categorical data (mode).
  • Constant Value: Fills in missing values with a constant. This can be useful if you want to mark missing data with a specific value.

2. KNN Imputation

Utilizes the k-NN algorithm to find the ‘k’ samples closest to the missing data point and imputes values based on the nearest neighbors. It considers the similarity between instances and can be more accurate than simple methods, especially when data patterns are complex.

3. Multiple Imputation by Chained Equations (MICE)

A more sophisticated approach that models each feature with missing values as a function of other features in a round-robin fashion. It creates multiple imputations (complete datasets) and analyzes them separately. The results are then combined to produce estimates that account for the uncertainty of the missing data.

4. Decision Trees Imputation

Decision Trees can be used to impute missing values by recursively partitioning the data into smaller subsets based on the observed features. The missing values are then imputed using the observed values in the same subset.

5. Matrix Factorization Techniques

Useful for datasets where the missing data structure can be thought of as a matrix (e.g., recommendation systems). Techniques like Singular Value Decomposition (SVD) can decompose the dataset into factors and help impute missing values by reconstructing the missing parts of the matrix.

6. Neural Network Models

  • Autoencoders: A type of neural network that can learn a compressed representation of the data. Once trained, you can feed in data with missing values, and the network can attempt to predict the missing values based on the learned representations.
  • Generative Adversarial Networks (GANs): GANs can be trained to generate data similar to the input data. They can be adapted to impute missing values by learning the distribution of the dataset and generating plausible values for missing data points.

Choosing the Right Model

  • Data Type and Pattern: Choose a method based on whether your data is numerical, categorical, or a mix of both, and the underlying patterns in your data.
  • Missing Data Mechanism: Understand if the data is missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR), as it influences the choice of imputation technique.
  • Model Complexity and Resources: More complex models like MICE, autoencoders, or GANs may provide better imputation but require more computational resources and expertise to implement and tune.
  • Final Application: The choice of imputation method can also depend on the downstream tasks (e.g., classification, regression, clustering) and how critical the imputation accuracy is for the overall project objectives.

It’s often beneficial to experiment with multiple imputation techniques to determine which works best for your specific dataset and use case.

--

--