The Last Step in Data Preprocessing: Handling Missing Values

Published in

MLJC

3 min readAug 30, 2020

We are at the end of this mini-series on Data Preprocessing, the last step (if you’ve missed the previous article, you can find it here) to overcome is how to handle missing values. In this short article, after introducing three categories of missing data, we’ll go trough a few simple techniques which can lead us to a solution.

“ Missingness” is almost always informative by itself, and we should tell our algorithm if a value is missing. Even if we build a model to impute our values, we are not adding any real information. We’re just reinforcing the patterns already provided by other features.

Basically, there are three categories of missing data:

MCAR (Missing Completely At Random) where the pattern of missinginess is statistically independent of the data record. Example: you have a data set on a piece of paper and you spill coffee on the paper destroying part of the data.
MAR (Missing At Random) where the probability distribution of the pattern of missingness is functionally dependent upon the observable component in the record. MCAR is a special case of MAR. Example: if a child does not attend an educational assessment because the child is (genuinely) ill, this might be predictable from other data we have about the child’s health, but it would not be related to what we would have measured had the child not been ill.
MNAR (Missing Not at Random) which is defined as the case which is NOT MAR, or when the missingness is specifically related to what is missing. Example: a person does not attend a drug test because the person took drugs the night before.

Let’s see a few strategies to impute missing values, i.e. to infer them from the known part of the data.

Univariate Feature Imputation

We can rely on scikit-learn’s SimpleImputer class, which provides a few strategies for imputing missing values, such as : imputing by a constant value, by using statistics (mean, median, etc).

Let’s see how we can replace missing values ( np.nan) using the mean value of the columns that contain the missing values.

SimpleImputer can also be used in conjunction with pandas and in particular with data represented as strings or categoricals, by using most_frequentor constant strategy:

Multivariate Feature Imputation

Another approach is to use the IterativeImputer class, which models each feature with missing values as a function of other features. How does each iteration work? At each step, a feature column is designated as output yand the other feature columns are treated as inputs X. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion.

IterativeImputer is very flexible, it allows you to use a variety of estimators, if you want to further delve into this class take a look at Imputing missing values with variants of IterativeImputer.

Nearest Neighbors Imputation

The KNNImputer class offers imputation for filling in missing values using the k-Nearest Neighbors approach. By default it uses an euclidean distance metric that supports missing values, nan_euclidean_distances.

The following code snippet shows how to replace missing values using the mean feature value of the two nearest neighbors of samples with missing values:

Thank you for your attention, in the next mini-series of article we’re going to introduce EDA (Exploratory Data Analysis) and some techniques for data visualization.

Here it follows a short list of additional articles/ tutorials for Data Preprocessing tasks.