GUIDE TO MANAGE MISSING DATA

series: NO DATA SCIENTIST IS THE SAME! — part 7

Karin Gruijs
Cmotions
21 min readMar 4, 2022

--

This article is part of our series about how different types of data scientists build similar models differently. No human is the same and therefore also no data scientist is the same. And the circumstances under which a data challenge needs to be handled change constantly. For these reasons, different approaches can and will be used to complete the task at hand. In our series we will explore the four different approaches of our data scientists — Meta Oric, Aki Razzi, Andy Stand, and Eqaan Librium. They are presented with the task to build a model to predict whether employees of a company — STARDATAPEPS — will look for a new job or not. Based on their distinct profiles discussed in the first blog you can already imagine that their approaches will be quite different.

In this article we will discuss Andy Stand’s and Aki’s approach.
Let me first remind you of who Andy is:

Andy Stand : ‘Understand is what we do’

Andy believes everything should be explainable. He is happy to sacrifice a bit of accuracy in order to achieve the most clear and understandable solution. Simple regressions and decision trees are the most utilised tools from his toolbox. Also in handling missing values he will choose the methods that he can explain easily.

Secondly we will see Aki’s approach for managing missing data.

Aki Razzi: ‘Accuracy is what truly matters’

Aki always wants to achieve the highest possible performance. Time and resources do not matter that much to her. She does not care whether a technique is easy to explain or not.

In the blog beat dirty data we saw several challenges which data scientists face when making a predictive model. Missing data is one of the most common challenges. In this article we will discuss and compare several methods to handle these missings. Which method you choose, depends on the business goal, available time, and your personal preference

What is the problem with missing data?

Missings in categorical and numerical variables should be considered separately. Unknown values can be grouped to form a separate category for categorical variables. You could for example make a dummy variable or do something else with it, see our blog on categorical variables.

You cannot do calculations with missing values. Formula-based algorithms like regression or neural networks cannot handle records with missing values. These algorithms need complete cases, without any missing values in any variable. Note that some algorithms like decision tree or gradient boosting can handle missing values. Therefore these algoritms need less data preparation and therefore could be faster to implement. But sometimes you want or need to use regression analysis or some other formula based algorithms that cannot cope with the missings.

Why not throw cases that have missing values away? This is usually NOT a good idea for several reasons:

  1. Maybe it does not sound very bad when each separate variable just has 1% missing values. But when you have 100 inputs with 1% missings randomly spread, then only 37% of the data are complete cases (0.99 to the power of 100). You might have too few data left to draw statistically sound conclusions and build a model on it. So only when they are very rare you could choose to leave missings out.
  2. And there is another reason not to throw columns or records with missings away. The missingness itself could be valuable information. Often the missingsness is not random, but there is a specific reason for it. Therefore missingness can be related to other variables and even the target. Leaving the records out will lead to biased model.
  3. You don’t want to loose information.

So throwing missings away could lead to statistically unstable, biased and unreliable preditive models. This forces you to do something about these missing values.

What to do with missing values?

First of all, you can try to find out what caused the missing values in a column or record. Maybe you could fix the problem, improve the proces for future records and/or better know how to impute the value (=replace by a non-missing).

Before replacing the missing values: make indicator variables for each variable! These variable are 1 when there was a missing value and 0 if not. In this way the information about the missingness itself is not thrown away. This could be crucial information to predict your target. For example: a missing salary could be predictive for defaulting on a credit loan yes or no.

There are three imputation techniques to replace missing values:

1. Impute all missings in a column with the same value.

This could be the mean, median, most frequent value of the variable or a value that makes sense businesswise. This is also called ‘univariate imputation’, since it is based on only one variable.

Explore the distributions of your numerical variables and then impute with:

Multivariate imputation uses more than one variable. The missing value is estimated based on all other features. Examples:

2. Replace the missings with a value per group.

The could be done in two ways:

  1. This could be based on another (categorical) variable. For example you can replace missing salary based on the median or mean value per age group.
  2. Another way is to use the KNN — K Nearest Neighbours method. Each sample’s missing values are imputed using the mean value from k nearest neighbors found in the training set.

3. Predict the missing value based on other variables.

An even more sophisticated approach is to model the feature with missing values as a function of other features, and use that estimate for imputation.

A way of doing this, is by iterative imputation or MICE: Multiple Imputation by Chained Equations. At every step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, until the algoritm converges or the maximum number of iterations has been reached.

In this article only the IterativeImputer of scikit-learn is used, but there are other methods like DecisionTreeRegressor, ExtraTreesRegressor (similar to missForest in R) or KNeighborsRegressor.

Question is: what method would you choose? Besides performance, your choice can be impacted by how interpretable you would like your model to be. Our character Andy wants to only use techniques that he can easily explain to the business. Thus he will stay away from options two and three. On the other hand Aki would choose a black-box technique if it results in a better predictive model.

Also: the choice could differ per dataset, there is no one size fits all.
Let’s see what gives the best results on our example dataset!

Overview of the how to handle missing values (source)

Initialization, import and prepare the data

  • First Andy installs the necessary packages to be able to do some data exploration and handle the missings with python.
  • Next step is to import the data. The dataset contains mostly categorical or ordinal data with missing values. The numerical variables hardly contain any missings. Therefore some of the ordinal variables (with missing values) are recoded as if they are numerical for the purpose of this article.
  • Then Andy makes a list for the numerical variables and categorical variables, because these are handled differently in later steps.
  • Another part of the preparation is to split data in train and test dataset.

Explore the data

Which variables have missing values?

Andy starts with exploring the data extensively. He knows that this is very important to make a well considered choice in how to deal with the missing values. One part is to explore distributions with statistics and graphs to see how to replace the missings. How many records contain missings? Which variables contain unknowns? Categorical or numerical variables?

Based on the statistics and graphs Andy will replace missing values by the median (high skewness), mean (more symmetric), a very frequent value or what makes sense businesswise.

(part of output)

Andy starts with checking out the number of missings for each variable…

Is there a relation between missingness and the target?

Andy looks at the target distibution of all data and compares it to the distribution of the target when only complete-case records are left (no missing values). The percentage of data scientists looking for a job is 25% in the original dataset, this is much higher than 16.8% in the dataset with only complete cases. Also the complete cases dataset has less than 50% of the original data left. Besides the fact that the complete-cases-dataset is much smaller, it is also a biased dataset.
Thefore Any’s conclusion is: ‘it is better not to leave out missing records’.

Original train dataset has 15326 records. Dataset after dropping all rows with missing values has 7153 records. This is only 46.7 % of the original dataset. The original train dataset has 25.0 % of data scientists looking for a new job. The dataset after dropping all rows with missing values has 16.8 % of data scientists looking for a new job.

Explore distributions using statistics, histograms and boxplots

The statistics show that ‘company_size_num’ has large difference between mean and median, also the skewness is larger than 1 (you can also use 2 or 3 as a threshold). The histogram and boxplot also show a very skewed distribution for the variable ‘company_size_num’. Therefore he decides to impute with the median value instead of the mean.

Based on the statistics, graphs and his business knowledge, he decides to treat missing values for the other variable as follows:

  • ‘experience_num’ is not too skewed, so he imputes with the mean value,
  • ‘last_new_job_num’ can be replaced with a zero, since he knows that a missing value means there was no last job.
  • for ‘education_level_num’ he thinks it makes sense to replace with the most common value: the modus.
(part of output)
(part of output)

Recode categorical variables, incl. missing values

Andy chooses to handle missing values as just another category for the categorical variables. Andy uses OnHotEncoding from scikit-learn. Another way to do it would be: pandas.get_dummies. For the difference between these two, see this blog. If you want to use more sophisticated ways to deal with categorical variables see our blog on categorical variables.

(part of output)

Create missing indicator variables

Andy knows that missingness itself can be important information. Therefore, before imputing, he creates ‘binary flags’ indicating if the value was missing or not.

This can be done in several ways like with scikit-learn’s SimpleImputer(add_indicator=True). Missing values are imputed and missing indicators are created. SimpleImputer can do two things in one step and also you can put this in a pipeline with other transformations. For an example of a pipeline, see Meta’s blog.

However Andy likes to be more in control. Now he can create indicator variables only for variables with more than 100 missing values. In his experience the indicator variable should have at least 100 ‘1’s to use it statically sound for modelling. Another advantage of not using SimpleImputer is that imputation can be done later on, having the freedom to choose different methods.

Impute missing values

Impute missing values by hand

Andy chooses to do the missing value replacement for each variable separately (by hand). If you have a lot of variables this can be time consuming. On the other hand: all is in control and can be explained to the business. So this is how Andy likes it.

Based on the statistics, graphs and his business knowledge, Andy decides to treat missing values per variable as follows:

  1. ‘experience_num’ is not too skewed, so he imputes with the mean value,
  2. ‘company_size_num’ is very skewed, so he replaces with the median value,
  3. ‘last_new_job_num’ can be replaced with a zero, since he knows that a missing value means there was no last job.
  4. for ‘education_level_num’ he thinks it makes sense to replace with the most common value: the modus.
  5. other variables do not have missing values in the trainingset, but the testdata can have missings. He replaces missing values in the test data by the mean value.

Andy creates a dictionary that contains pairs of the variable names and its replacement values. These replacement values are calculated based on the training dataset. In the next step the dictionary is applied on the test data, later on it can be applied on a new dataset.

Impute missing values by mean, median or most frequent value

Andy likes to do a lot of data preprocessing per variable or ‘by hand’, so he can explain all missing replacements. But Aki likes to test several approaches. Therefore she also tests what happens when you replace all missing values with the mean value, median value or most frequent value of each variable in the training dataset. She will use skicit-learn’s SimpleImputer for these tests. Like Andy she chooses to add the missing indicator separately (see above ‘create missing indicator variables’).

First Aki starts with replacing missing with the mean value of each variable in the training dataset. For this she uses SimpleImputer(strategy=’mean’).

Let’s see how the model performs if all missings in the numeric variables are replaced by the mean value!

Secondly Aki replaces missing with the median value of each variable in the training dataset. For this she uses SimpleImputer(strategy=’median’).

Let’s see how the model performs if all missings in the numeric variables are replaced by the most frequent value!

Impute missing values per group using K-means Nearest Neighbour

Aki also likes to test multivariate methods to replace the missings, besides the simple univariate methods she already tested. One way is using the K-means Nearest Neighbour method. In this method each of the samples’ missing values are imputed using the mean value from k nearest neighbors found in the training set.

To do this in Python, you can use KNNImputer of scikit-learn. For example: KNNImputer(n_neighbors=5, add_indicator=True). This blog also shows you how to use KNNimputer.

(part of output)

Let’s see whether the model performance is better replacing missings values with the K-means Nearest Neighbour algorithm!

Iterative imputing to replace missing values

Aki also likes to see if a more sophisticated method work better: predicting the missings based on the other features and use the estimates for imputation.

One way of doing this, is by iterative imputation, also called MICE, Multiple imputation by chained equations. At every step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, until the algoritm converges or the maximum number of iterations has been done.

Besides IterativeImputer of scikit-learn you can use other method to predict missings, like DecisionTreeRegressor, ExtraTreesRegressor (similar to missForest in R) or KNeighborsRegressor.

You can find very good scikit-learn explanations on these methods:

  1. Imputation of missing values
  2. sklearn.impute.IterativeImputer
  3. Imputing missing values with variants of IterativeImputer

Below you can see how to see the default ‘IterativeImputer(max_iter=5)’. This blog also shows you how to use IterativeImputer.

(part of output)

Let’s see how the model performs when using the more advanced way iteratively imputing to replace missing values!

In short…

What method works best for this dataset?
Overall we saw that several methods gave comparable results:

For this dataset both Andy and Aki will choose to replace by the mean value and add missing indicator! The method is easy to explain and the model performance is highest for the mean-imputation.

Of course replacing by the mean value is not always best to do. Which method to choose depends on your data. If you have the time then you can find out what works best by testing several options. The methods you test and choose depend on what the business asks for (explanability or not), your available time and your own preference as a data scientist.

This article is of part of our No Data Scientist Is The Same series. The full series is written by Anya Tonne, Jurriaan Nagelkerke, Karin Gruijs-Vodde and Tom Blanke. The series is also available on theanalyticslab.nl.

An overview of all articles on Medium within the series:

Do you want to do this yourself? Please feel free to download the Notebook on our gitlab page.

--

--