A Brief and Beautiful Introduction to Imputation of Missing Data

Because data is beautiful

tanta base
3 min readDec 7, 2023

When working with real world data sets missing data can be quite common. Data can be missing at random or the data can be missing systematically. Missing data can impact your down stream analysis by causing biases and distorting inferences. It can also reduce efficiencies and make data engineering that more complex! Imputing missing data is a crucial part of data analysis and preparing your dataset for machine learning.

close up of flower petals
Do you know what data and flowers have in common?

Missing data can happen due to:

  • human error, such as an error with data collection or data entry
  • nonresponse, which means when a participant skips a question on a form or questionnaire
  • missingness, is when a participant drops out of a test

You may have read about just deleting rows with missing values, but unfortunately this will also impact your down stream analysis. Since data is so vital, you typically want to keep as much data as you can.

I won’t go too deep into any one topic, use this as a gentle introduction or as a jumping off point for further studying. With that in mind, this article has you covered on machine learning and non-machine learning techniques to impute missing data.

abstract water color painting
Answer: Data and flowers are both beautiful!

Imputation

Otherwise known as replacing missing values with substitutions. Unit imputation is replacing for a data point and item imputation is replacing a component of a data point.

Non-Machine Learning Techniques

Here are some no-nonsense non-machine learning techniques to get you started:

  • Hot-Deck is using the same dataset and replacing a missing data point with a data point from the most relevant row immediately prior to it
  • Cold-Deck replaces data points with similar data points from another dataset
  • Mean substitution is replacing missing data points with the mean of the entire column. Use this technique if your data is numerical.
  • Median substitution can be used when outliers are present
  • Mode substitution is used if you are working with categorical data.

These techniques are straightforward and fast to implement, however they do introduce bias, can be a naive approach and overall not an accurate method.

Machine Learning Techniques

Lets move on to some more complicated imputation techniques:

  • Regression Imputation can be used if your data points are missing at random and a correlation exists between the missing data point and other data points. It imputes a missing data point by predicting what that data point will be, based on other data points. You can use this technique for linear and non-linear relationships and this is often used for numerical data.
  • Multiple Imputation by Chained Equations Multiple Imputation by Chained Equations used often for random missing data, when the missing points are correlated with the target and when their are missing values for one or more columns. This is done by essentially making multiple copies of the data set, replacing the missing values with a place holder (usually the mean), and then regressing on the dataset copies. At the end, the place holder imputations are replaced with a predicted value. Find out more here.
  • K-Nearest Neighbors can replace missing data by finding the “K“ nearest neighbor (most similar rows) and averaging their value. A distance metric is often used for each row. This is often used for numerical data.

Well there you have it! I hope this sets you up for success when you’re prepping the data for analysis or machine learning.

Want more? Check this article on feature engineering and this cheat sheet series on built-in algorithms in AWS:

Full list of all installments:

  • 1/5 for Linear Learner, XGBoost, Seq-to-Seq and DeepAR here
  • 2/5 for BlazingText, Object2Vec, Object Detection and Image Classification and DeepAR here
  • 3/5 for Semantic Segmentation, Random Cut Forest, Neural Topic Model and LDA here
  • 4/5 for KNN, K-Means, PCA and Factorization for here
  • 5/5 for IP insights and reinforcement learning here

--

--

tanta base

I am data and machine learning engineer. I specialize in all things natural language, recommendation systems, information retrieval, chatbots and bioinformatics