# Impute Missing Values the Right Way

Sep 21, 2020 · 8 min read

If you’re in traditional machine learning, you are likely facing an ongoing battle with missing values (NaN) in your data.

This is the case in just about every Kaggle competition as well as in most of the tabular datasets out there.

Here I will go over all the viable approaches, provide the code in python and add a cherry on top at the end of this paper.

If you’re eager to jump into coding as you read, don’t. Finish reading first.

Spoiler: in this article I will elaborate on the issue concisely, but if you just need a one-line solution:

# Problem statement

Missing values are the emty records (cells) in your csv/excel/dataframe. They can appear in different types of columns:

numeric

binary

• represented as numeric
• represented as strings

categorical

• represented as numeric
• represented as categoric

text

And there are different peculiarities in dealing with NaNs in each of these columns.

## Common practice

Most practitioner Data Scientists will want to quickly get rid of the missing values and start fitting models to deliver the result.

Hence the quick solution is just to fill NaNs with certain constants for each type of the column with missing values:

numeric — mean/median of the column:

`df[col].fillna(df[col].mean(), inplace = True)`

binary/categoric (regardless of the data type in the column)

• most common category:
`most_common_category = df[col].value_counts().index[0]df[col].fillna(most_common_category, inplace = True)`
• introduce a new category

Regardless the simplicity — this approach is viable as missing data may appear due to a specific reason and making this ‘reason’ a new category may introduce a specific knowledge

`new_category = -999 # this can also be a stringdf[col].fillna(new_category, inplace = True)`

text constant string

`df[col].fillna('Missing_data', inplace = True)`

While all of these options are viable and widely used, the implications of this approach is adding bias to your data.
Let me give you an example:
Suppose you have a dataset with 3 rows and you’re missing one value in the airport_area column.

So the we know that NY airport has an area of 10 000 sq.miles, Atlanta airport: 9 000 sq.miles and the area for Milwaukee airport is missing. Using the column mean approach we would fill the NaN for Milwaukee area with 9 500 sq.miles almost equaling this small city’s airport to the ones on NY and Atlanta.

This way we have introduced a faulty information in our statistics and we will use this information as the ground truth when training our model. You get the idea…

Common practice deviations

Another solution is to drop rows with missing values. But you have to make sure you’re left with enough data to train on. A good benchmark would be at least 80% of your original data, subject to overall dataset size. E.g. if you have 1 000 000 rows in your original dataset with NaNs, then dropping 400K rows should not cause problems.

Dropping whole columns with NaNs is yet another option. This could be valid if a column consists of over 50–70 percent of missing values, otherwise you may be removing valuable information from your data.

Sometimes I have seen people placing zeros into the the missing values across the whole dataset…

A much better approach to dealing with missing values is predicting them with machine learning. This is a much more accurate solution to the problem.

Let’s get back to our Milwaukee airport_area example:

Now instead of filling the missing airport_area for Milwaukee based on mean areas of NY and Atlanta areas, let’s have a look at other known characteristics. E.g. passengers daily transfer, number of exits, number of airstrips, etc. Having learned on the known airport_area examples and their corresponding characteristics a model can then predict the unknown airport_area based on the other charachteristics of this object.

So first you need to find out where are the missing values located in a particular column

`col_nan_ix = df[df[col].isnull()].index`

next you need to split your data into the training set (with known values for this column) and the prediction set (where we will need to predict missing values for this column); we will call them train/test for convenience, because this is nothing different than fitting a final model for your project

`col_test = df[df.index.isin(col_nan_ix)]col_train = df.drop(col_nan_ix, axis = 0)`

At this point you might be ready to fit your model. And depending on the model type you have selected, you might need to check for other missing values (in the independent features) in your col_train/col_test datasets.

If this is the case, further train/test sets preparation in terms of NaNs in the independent features will depend on your model of choice to train on `[feat1,feat2,feat3]` and the `target_col`.

If you decide to use DecisionTreeRegressor or LinearRegression or basically any other model (which can not train on data with NaNs) except for XGBoost/CatBoost — you will need to fill these NaNs as described above (Common practice). I have mentioned that this approach has it’s disadvantages, but as an intermediate step for filling NaNs in the created training subset’s indimpendent features it is feasible, as long as your actual missing vaules that you will then replace by the predicted values will come from a model’s prediction.

Sometimes you would opt for a simpler model just because you might have hundreds of columns with missing values and hundreds of thousands of rows and fitting XGBoost models for each one of them may be costly. If you still want to use such a heavy model, you might want to consider reducing the size of your training data `col_train` for each training dataset created to predict a particualr column’s NaNs:

• limit the number of rows (10K or so should be enough)
• limit the number of independent features based on their corellation to the `target_col` (10–20 features). All features must be numeric for measuring correlations

There is one more issue to deal with before we can fit any model: text in your data — that is a categorical column `feat3` with values `['a','b','c']`. Remember, all the data has to be numeric for a model to be able to train.

If you use an advanced model, such as XGB, RF, etc. — it is totaly okay to replace these values by integers (factorization). For linear models a better option would be one-hot-encoding.

Factorization example:

`uniques = col_train['feat3'].dropna().unique().tolist()mapping_dict = {key:val for val, key in enumerate(uniques)}`
`# now replace values in your col_train['feat3'].columncol_train['feat3'] = col_train['feat3'].map(mapping_dict)`

Same has to be done to all the columns in your train/test subsets that are non numeric.

Note: it is better to create a copy of the original dataframe at the beginning and transform the whole thing before creating all the splits for each column and then transforming them separately.

`# select, initialize modelfrom xgboost import XGBRegressormodel = XGBRegressor()# fit on your training set for a particular columnmodel.fit(col_train.drop('target_col', axis = 1), col_train['target_col')# predictpred_col = model.predict(col_test.drop('target_col', axis = 1))# place your predictions into the locations of the missing values in your full datasetdf[col][col_nan_ix] = pred_col`

Note: if your target column is represented as a categorical non numeric data type it will have to be transformed to numeric as I have shown above. But we can’t just replace it with random numbers! We have to remember how this column has been transformed so we can reverse transform it back to original values after training/prediction. In this case the created mapping_dict will come in handy: you will just have to reverse this dict before mapping it to the numerics to receive text. `encoding_map = {val:key for key,val in encoding_map.items()}`

This is basically the full pipeline for one column missing values imputation using xgboost.

This was a simple example, however with real data the named choices may not be that obvious. E.g. a categoric column represented as numeric: in this case you need to spot such columns, somehow understand that they are categoric and use that for further transformation/model configuration, etc.

If you want to impute NaNs using ML in hundreds of columns with missing values across your dataset — you will have make a lot of choices ranging from: how to transform which column, what to do with NaNs in the independent features, which model to select and how to configure it, etc. This can become tedious.

# Impute all NaNs by ML with a single line of code

`\$ pip install verstack`

context:

I have created an open-source library that does it all with just one call to a function. It basically imputes all the missing values in all your columns using advanced XGBoost models and returns a dataframe with all NaNs filled by their corresponding predictions.

Since the missing values prediction process can take a significant amount of time, the default parameters will utilize all the available CPU cores your system has to offer and impute NaNs in all the columns in parallel. My machine with 12 cores imputation averages approximate 10X speed up compared to using a single core. This can be configured to a half of the available cores or just one in the parameters (see below).

To start with it just go to the console (terminal/cmd/anaconda prompt) and

`\$ pip install verstack`

in python

`from verstack import NaNImputerimputer = NaNImputer()df_withot_nans = imputer.impute(df_with_nans)`

And that is it.

Let me give you an example on the house_prices_advanced_regression_techniques data set from kaggle

Below are the standard prints you are going to see on the imputation progress. Can be disabled by `verbose = False`

The picture really says it all…

The default settings are configured for highest performance, but you can define them as you like:

All the parameters are set during initializing the imputer.

Say you would like to impute missing values in a list of specific columns, use 20 most important features for each of these columns imputation and deploy a half of the available cpu cores for the job (so you can still check your email as the job gets done):

`imputer = NaNImputer(nan_cols = ['col1', 'col2'], n_feats = 20, multiprocessing_load = 2)df_imputed = imputer.impute(df)`

Happy coding.

Documentation page

Git

Pypi

author

## The Startup

Get smarter at building your thing. Join The Startup’s +787K followers.

### By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

Medium sent you an email at to complete your subscription.

Written by

## Danil Zherebtsov

Applied Machine Learning Engineer

## The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +787K followers.

Written by

## Danil Zherebtsov

Applied Machine Learning Engineer

## The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +787K followers.

## Probabilistic Deep Learning: Bayes by Backprop

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app