An Introduction to Imputation: Solving problems of missing and insufficient data

Missing data is a common problem. Here’s how imputation can help.

Ben Dexter Cooley
Jul 30, 2020 · 4 min read
Photo with modifications by Jason Blackeye on Unsplash

For many data scientists, analysts, or engineers, dealing with missing or bad data is an everyday experience. A dataset could have missing values for a key period of time, or perhaps the dataset contains outlier values that need to be corrected.

Often, you may look for new data or work with small subsets of the dataset. However, the complete data set, after correcting for its limitations, can hold real insights. What can you do to preserve the integrity of the data while still mining it for useful signal?

Imputation can help solve this problem. Over the coming weeks, the Tagup team will publish several articles in a series on imputation, its applications, and how to apply it on a practical level.

In this article, we’ll briefly discuss what imputation is, how it can be useful, and how imputation using machine learning models differs from other standard methods of dealing with missing or insufficient data.

The process of imputation

Diagram showing the process of applying mean imputation to a column of data.

All methods of imputation have different sets of pros and cons (discussed later in the article). Basically, you can think of imputation as a set of rules: if a dataset contains missing values, apply a certain calculation to create a “best guess” replacement.

Why imputation is useful

Imputation is a tool to recoup and preserve valuable data.

Imputation allows you to:

  • Troubleshoot what may be happening in periods of missing data by simulating possible values
  • Synchronize time scales for machine learning/modeling
  • Smooth extremely noisy data
A sample measurement from the Tagup application, showing a period of missing data at the gray shaded regions of the chart. This data should be considered pre-imputation; for raw data in this chart, we are only applying interpolation (for predicting values at unsampled locations) between two data points instead of model-based imputation (to fill in missing values with new values).

For example, imputation can be used to fill in missing sensor measurements if you lose data communication for a day. By identifying the time range (one day) and frequency of expected measurements, you can use imputation to simulate what “normal” operating conditions would look like for this time.

Imputation with machine learning

  • Mean or median imputation
  • Imputation using most frequent values
  • Linear regression imputation
  • Multivariate imputation by chained equation (MICE)
  • k-nearest neighbors algorithm (k-NN)
  • Hidden Markov Model (HMM)

For now, it’s useful to consider the following example: say you are monitoring a fleet of assets for a critical threshold alarm and you lose data communications for one of many sensor measurements. The missing data totals to about 5% of the total time range. A traditional method of imputation, such as using the mean or perhaps the most frequent value, would fill in this 5% of missing data based on the values of the other 95%.

But this traditional approach has an inherent risk: alarms and thresholds are infrequent and often short. Certain “spikes” or “anomalies” in data, by their very nature, cannot be predicted based on what is considered an average value in the dataset.

Machine learning methods such as the k-nearest neighbors algorithm (k-NN) or Hidden Markov Model (HMM) provide a more complex set of calculations for imputation. Unlike traditional methods, it also gives you more imputing abilities such as:

  • Accounting for correlation between different features, rather than treating them separately
  • Accounting for uncertainty bounds
  • Imputing categorical values as well as numerical

In future posts within this series, we’ll break down in more detail the various applications of imputation using machine learning. We will also look at how to best visualize imputation results, and how to create and tune an imputation model.

Stay tuned for more articles on imputation methods by following When Machines Learn.

When Machines Learn

Sharing research and insight into applying machine learning to industrial asset management.

When Machines Learn

A blog to share research and work in applying machine learning in heavy industry. Focus includes asset management and process optimization.

Ben Dexter Cooley

Written by

Visualization Software Engineer @ Pattern (Broad Institute). Designer, developer, data artist. Portfolio: | Art:

When Machines Learn

A blog to share research and work in applying machine learning in heavy industry. Focus includes asset management and process optimization.