Impute Missing Values the Right Way

Danil Zherebtsov
Sep 21, 2020 · 8 min read

If you’re in traditional machine learning, you are likely facing an ongoing battle with missing values (NaN) in your data.

This is the case in just about every Kaggle competition as well as in most of the tabular datasets out there.

Here I will go over all the viable approaches, provide the code in python and add a cherry on top at the end of this paper.

If you’re eager to jump into coding as you read, don’t. Finish reading first.

Having read this article you will have dealt with this problem once and for all.

Spoiler: in this article I will elaborate on the issue concisely, but if you just need a one-line solution:

Problem statement

Missing values are the emty records (cells) in your csv/excel/dataframe. They can appear in different types of columns:

numeric

binary

  • represented as numeric
  • represented as strings

categorical

  • represented as numeric
  • represented as categoric

text

And there are different peculiarities in dealing with NaNs in each of these columns.

Common practice

Most practitioner Data Scientists will want to quickly get rid of the missing values and start fitting models to deliver the result.

Hence the quick solution is just to fill NaNs with certain constants for each type of the column with missing values:

numeric — mean/median of the column:

df[col].fillna(df[col].mean(), inplace = True)

binary/categoric (regardless of the data type in the column)

  • most common category:
most_common_category = df[col].value_counts().index[0]
df[col].fillna(most_common_category, inplace = True)
  • introduce a new category

Regardless the simplicity — this approach is viable as missing data may appear due to a specific reason and making this ‘reason’ a new category may introduce a specific knowledge

new_category = -999 # this can also be a string
df[col].fillna(new_category, inplace = True)

text constant string

df[col].fillna('Missing_data', inplace = True)

While all of these options are viable and widely used, the implications of this approach is adding bias to your data.
Let me give you an example:
Suppose you have a dataset with 3 rows and you’re missing one value in the airport_area column.

So the we know that NY airport has an area of 10 000 sq.miles, Atlanta airport: 9 000 sq.miles and the area for Milwaukee airport is missing. Using the column mean approach we would fill the NaN for Milwaukee area with 9 500 sq.miles almost equaling this small city’s airport to the ones on NY and Atlanta.

This way we have introduced a faulty information in our statistics and we will use this information as the ground truth when training our model. You get the idea…

Common practice deviations

Another solution is to drop rows with missing values. But you have to make sure you’re left with enough data to train on. A good benchmark would be at least 80% of your original data, subject to overall dataset size. E.g. if you have 1 000 000 rows in your original dataset with NaNs, then dropping 400K rows should not cause problems.

Dropping whole columns with NaNs is yet another option. This could be valid if a column consists of over 50–70 percent of missing values, otherwise you may be removing valuable information from your data.

Sometimes I have seen people placing zeros into the the missing values across the whole dataset…

Advanced practice

A much better approach to dealing with missing values is predicting them with machine learning. This is a much more accurate solution to the problem.

Let’s get back to our Milwaukee airport_area example:

Now instead of filling the missing airport_area for Milwaukee based on mean areas of NY and Atlanta areas, let’s have a look at other known characteristics. E.g. passengers daily transfer, number of exits, number of airstrips, etc. Having learned on the known airport_area examples and their corresponding characteristics a model can then predict the unknown airport_area based on the other charachteristics of this object.

So first you need to find out where are the missing values located in a particular column

col_nan_ix = df[df[col].isnull()].index

next you need to split your data into the training set (with known values for this column) and the prediction set (where we will need to predict missing values for this column); we will call them train/test for convenience, because this is nothing different than fitting a final model for your project

col_test = df[df.index.isin(col_nan_ix)]
col_train = df.drop(col_nan_ix, axis = 0)
we will learn on the left set with known values for column and predict the unknown values for this column on the right set

At this point you might be ready to fit your model. And depending on the model type you have selected, you might need to check for other missing values (in the independent features) in your col_train/col_test datasets.

missing data example in the independent features

If this is the case, further train/test sets preparation in terms of NaNs in the independent features will depend on your model of choice to train on [feat1,feat2,feat3] and the target_col.

If you decide to use DecisionTreeRegressor or LinearRegression or basically any other model (which can not train on data with NaNs) except for XGBoost/CatBoost — you will need to fill these NaNs as described above (Common practice). I have mentioned that this approach has it’s disadvantages, but as an intermediate step for filling NaNs in the created training subset’s indimpendent features it is feasible, as long as your actual missing vaules that you will then replace by the predicted values will come from a model’s prediction.

Sometimes you would opt for a simpler model just because you might have hundreds of columns with missing values and hundreds of thousands of rows and fitting XGBoost models for each one of them may be costly. If you still want to use such a heavy model, you might want to consider reducing the size of your training data col_train for each training dataset created to predict a particualr column’s NaNs:

  • limit the number of rows (10K or so should be enough)
  • limit the number of independent features based on their corellation to the target_col (10–20 features). All features must be numeric for measuring correlations

There is one more issue to deal with before we can fit any model: text in your data — that is a categorical column feat3 with values ['a','b','c']. Remember, all the data has to be numeric for a model to be able to train.

If you use an advanced model, such as XGB, RF, etc. — it is totaly okay to replace these values by integers (factorization). For linear models a better option would be one-hot-encoding.

Factorization example:

uniques = col_train['feat3'].dropna().unique().tolist()
mapping_dict = {key:val for val, key in enumerate(uniques)}
mapping_dict for feat3
# now replace values in your col_train['feat3'].column
col_train['feat3'] = col_train['feat3'].map(mapping_dict)

Same has to be done to all the columns in your train/test subsets that are non numeric.

Note: it is better to create a copy of the original dataframe at the beginning and transform the whole thing before creating all the splits for each column and then transforming them separately.

# select, initialize model
from xgboost import XGBRegressor
model = XGBRegressor()
# fit on your training set for a particular column
model.fit(col_train.drop('target_col', axis = 1), col_train['target_col')
# predict
pred_col = model.predict(col_test.drop('target_col', axis = 1))
# place your predictions into the locations of the missing values in your full datasetdf[col][col_nan_ix] = pred_col

Note: if your target column is represented as a categorical non numeric data type it will have to be transformed to numeric as I have shown above. But we can’t just replace it with random numbers! We have to remember how this column has been transformed so we can reverse transform it back to original values after training/prediction. In this case the created mapping_dict will come in handy: you will just have to reverse this dict before mapping it to the numerics to receive text. encoding_map = {val:key for key,val in encoding_map.items()}

This is basically the full pipeline for one column missing values imputation using xgboost.

This was a simple example, however with real data the named choices may not be that obvious. E.g. a categoric column represented as numeric: in this case you need to spot such columns, somehow understand that they are categoric and use that for further transformation/model configuration, etc.

If you want to impute NaNs using ML in hundreds of columns with missing values across your dataset — you will have make a lot of choices ranging from: how to transform which column, what to do with NaNs in the independent features, which model to select and how to configure it, etc. This can become tedious.

Impute all NaNs by ML with a single line of code

short answer:

$ pip install verstack

context:

I have created an open-source library that does it all with just one call to a function. It basically imputes all the missing values in all your columns using advanced XGBoost models and returns a dataframe with all NaNs filled by their corresponding predictions.

Since the missing values prediction process can take a significant amount of time, the default parameters will utilize all the available CPU cores your system has to offer and impute NaNs in all the columns in parallel. My machine with 12 cores imputation averages approximate 10X speed up compared to using a single core. This can be configured to a half of the available cores or just one in the parameters (see below).

To start with it just go to the console (terminal/cmd/anaconda prompt) and

$ pip install verstack

in python

from verstack import NaNImputer
imputer = NaNImputer()
df_withot_nans = imputer.impute(df_with_nans)

And that is it.

Let me give you an example on the house_prices_advanced_regression_techniques data set from kaggle

Below are the standard prints you are going to see on the imputation progress. Can be disabled by verbose = False

The picture really says it all…

The default settings are configured for highest performance, but you can define them as you like:

All the parameters are set during initializing the imputer.

Say you would like to impute missing values in a list of specific columns, use 20 most important features for each of these columns imputation and deploy a half of the available cpu cores for the job (so you can still check your email as the job gets done):

imputer = NaNImputer(nan_cols = ['col1', 'col2'], n_feats = 20, multiprocessing_load = 2)
df_imputed = imputer.impute(df)

Happy coding.

Links

Documentation page

Git

Pypi

author

The Startup

Get smarter at building your thing. Join The Startup’s +787K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Danil Zherebtsov

Written by

Applied Machine Learning Engineer

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +787K followers.

Danil Zherebtsov

Written by

Applied Machine Learning Engineer

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +787K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store