Missing Values in Dataset Preprocessing

Wojtek Fulmyk, Data Scientist
5 min readJul 27, 2023

--

Article level: Intermediate

My clients often ask me about the specifics of certain data pre-processing methods, why they’re needed, and when to use them. I will discuss a few common (and not-so-common) preprocessing methods in a series of articles on the topic.

In this preprocessing series:

Data Standardization — A Brief Explanation — Beginner
Data Normalization — A Brief Explanation — Beginner
One-hot Encoding — A Brief Explanation — Beginner
Ordinal Encoding — A Brief Explanation — Beginner
Missing Values in Dataset Preprocessing — Intermediate
Text Tokenization and Vectorization in NLP — Intermediate

Outlier Detection in Dataset Preprocessing — Intermediate

Feature Selection in Data Preprocessing — Advanced

In this specific short writeup I will explain how to deal with missing values. Some understanding of specific terms would be helpful, so I attached a short explanation of the more complicated terminology. Give it a go, and if you need more info, just ask in the comments section!

MCAR (Missing Completely at Random) — Data is missing in a way that is unrelated to the observed or unobserved data.

MAR (Missing at Random) — Data is missing in a way that is related to the observed data but not the unobserved data.

MNAR (Missing Not at Random) — Data is missing in a way that is related to the unobserved data.

imputation — Replacing missing values with predicted values.

bias — Systematic error introduced by a method or model.

mean — The average value of a set of numbers.

regression — A method for modeling the relationship between variables.

stochastic — Involving a random element or process.

overfitting — A model that is too complex and fits the training data too closely.

principled statistical approach — A method that is based on sound statistical principles.

iterative algorithm — A method that repeats a process until a desired outcome is achieved.

constant value — A fixed value that does not change.

Missing Values

Missing values are a normal occurrence in data sets, and it’s important to handle them appropriately so as not to lose valuable information. In this article I will explore several techniques for handling missing values including deletion, mean imputation, regression imputation, multiple imputation, hot deck imputation, expectation-maximization, and using a constant value.

MCAR

Deletion: Deletion involves removing entire rows or columns that contain missing values. While this approach is easy to implement, it can result in a loss of valuable information and may introduce bias.

Mean Imputation: Mean imputation involves replacing missing continuous values with the mean of that feature calculated from observed values. Mean imputation is easy to implement, but can introduce bias to the distribution of the data.

MAR

Regression Imputation: Regression imputation involves training a regression model on complete samples, and then using it to estimate missing values. For instance, if a feature Y correlates strongly with features A and B, missing values in Y can be imputed by predicting Y from A and B. Stochastic regression can be used to add randomness to predictions and avoid overfitting a single value. Multivariate regressions can model linear relationships between multiple features.

Hot Deck Imputation: Hot deck imputation involves replacing missing values with observed values from similar records. For instance, if a record has a missing value for feature Y, but has observed values for features A and B, a similar record with observed values for all three features can be used to impute the missing value in Y. Hot deck imputation can be implemented using various methods such as nearest neighbor matching or random selection from a pool of similar records.

MNAR

Multiple Imputation: Multiple imputation is a technique that generates multiple imputed values for each missing entry. It involves training a model on complete data and using it to impute missing values, creating multiple imputed versions of the data set. Model training and prediction occur on each imputed set, and the results are averaged to obtain final predictions, accounting for imputation uncertainty. Multiple imputation provides a principled statistical approach but increases computational expense.

Expectation-Maximization: Expectation-maximization (EM) is an iterative algorithm that estimates missing values by maximizing the likelihood of the observed data. EM involves alternating between two steps: an expectation step that computes the expected value of the missing data given the observed data and current estimates of model parameters; and a maximization step that updates the estimates of model parameters given the expected value of the missing data. EM can handle various types of data and models but may require careful initialization and convergence monitoring.

Simplest, but not best, solution

Using a Constant Value: Another method for handling missing values is to replace them with a constant value. This value can be chosen based on domain knowledge or by using a value that is outside the range of observed values for that feature. While this approach is easy to implement, it can introduce bias to the distribution of the data. The impact of using a constant value on MCAR, MAR, and MNAR data depends on how well the chosen value represents the underlying distribution of the missing data. If the chosen value does not accurately represent the distribution of the missing data, it can introduce bias regardless of whether the data is MCAR, MAR or MNAR.

Useful Python Code

To give you some understanding of the code involved in this kind of preprocessing, I will show you how to generate random data with missing values, then I will impute the missing components with a column-wise mean approach, and also show you the hot-deck imputation method.

Generating random data with missing values:

import numpy as np
import pandas as pd

# example df of 10 rows and 5 columns
df = pd.DataFrame(np.random.randn(10, 5))

# Sets random 20% of values to NaN
num_nan = int(df.size * 0.2)

# Loop to randomize the NaN values
for _ in range(num_nan):
i = np.random.randint(0, df.shape[0])
j = np.random.randint(0, df.shape[1])
df.iloc[i, j] = np.nan
# ensures all values are positive
df = df.abs()

print(df)

Mean Imputation (column-wise):

# Impute missing values with column-wise mean values
df.fillna(df.mean(), inplace=True)

# Print updated dataset
print(df)

Hot-deck Imputation:

# Impute missing values with hot deck imputation
for col in df.columns:
for i, val in enumerate(df[col]):
if pd.isna(val):
df.at[i, col] = df[col].dropna().sample().iloc[0]

# Print updated dataset
print(df)

Proper techniques help retain information and reduce bias from missing values when deleting them is not an option. However, you must become aware of the limitations and potential drawbacks of each technique, such as the introduction of bias or increased computational expense. Go ahead and choose whatever works — within each method there is a lot of interesting variation that can optimize your missing data imputation (or deletion).

Trivia

  • Data missing not at random (MNAR) is the most challenging type of missing data for machine learning because the missingness depends on the unseen data values themselves. This means the missing data mechanism can’t be ignored — it must be explicitly modeled along with the rest of the data. The dependence on unseen data makes it more complex to handle than MCAR or MAR missing data.
  • For categorical variables with missing data, mode imputation is preferred over mean imputation for machine learning preprocessing. The mode preserves distributions and relationships in the data, ensuring imputed categorical values are valid per the original data. Mean imputation can introduce invalid categories, distorting distributions and summaries. Mode keeps relationships intact, maintaining potentially important correlations and patterns for modeling. So, mode imputation is generally superior for categorical predictors, targets, and features.

--

--

Wojtek Fulmyk, Data Scientist

Data Scientist, University Instructor, and Chess enthusiast. ML specialist.