The Mystery of Missing data

Anirudh Dayma
Analytics Vidhya
Published in
4 min readMar 26, 2020

It’s not about the model it’s about what we feed (data). The secret lies within.

Missing data

In the real world the data we come across isn’t beautiful as we expect it to be. It will contain some missing values. So before proceeding with any analysis we need to get rid of these missing values (not actually get rid by deleting such entries but actually handle them, deleting such entries would possibly be the last option we would like to choose).

There are tons of ways to handle missing data but before that we need to identify missing data and prior to that we need to know what is missing data. So the first question we have is:

What is missing data and why do we even care about it?

While capturing data sometimes it might happen that the data is not captured. Now this can happen due to corrupt data, failure while extracting data from source or any other glitch. This can cause absence of data which is termed as missing data.

It is often said that the more beautiful data is, more accurate will be the model (beautiful in this context means good data). So if the data which we provide to our model is incomplete then the model would not be able to capture the entire crux of data hence our model will be poor. Therefore we are concerned about the data which we give to our model.

How do we identify missing values?

There are multiple ways to identify missing values:

  • Using df.isnull() or df.isna() — both these methods would give the same result.

We would use Titanic data-set for demonstrating examples.

train = pd.read_csv('data/train.csv')
train.head()
train.isnull().sum()

So we can see that Age, Cabin and Embarked columns have missing values.

  • Using missingno library:
# Use pip install missingno to install this library
import missingno as msno
msno.matrix(train)

In this plot, the white lines indicate missing values.

msno.bar(train)

The above plot shows a dual axis graph, left side shows proportion ranging from 0 to 1, the right hand side shows the row number. Age column has 0.8 i.e. 80% values as not null, 741 values are present out of 891. Cabin column has 20% values i.e. 80% null values, 204 values are present out of 891.

How to handle these missing values?

There are different techniques to handle missing values:

  1. Imputation:

There are different methods to impute values:

  • Traditional df.fillna() method.

In this, we replace the missing value with some value mostly mean, median or mode.

train.Age.fillna(value= train.Age.mean())

This would replace the missing values in Age column by mean, similarly we can replace with mode and median.

We can also use interpolation to fill the missing values.

train.Age.fillna(method= 'ffill')

'ffill' stands for 'forward fill' and propagate[s] last valid observation forward to next valid. The alternative is 'bfill' which works the same way, but backwards.

  • Using SimpleImputer.
from sklearn.impute import SimpleImputer

# Using SimpleImputerfunction to replace NaN
# values with mean of that parameter value
imp=SimpleImputer(missing_values=np.NaN, strategy="mean" )
train['Age']=imp.fit_transform(train[["Age"]]).ravel()

Here you are free to choose any strategy i.e. mean, median etc.

2. Creating a unique category:

We can also create a separate category. For example with Cabin column we can create a new category as “unknown”. We are creating a separate class for missing values. We can use train.Cabin.fillna(‘unknown’) to create this separate category.

3. Predicting the missing values:

We can use a machine learning algorithm to predict the missing values. For example in our case we can go ahead predicting “Age” using linear regression. We can try different algorithms and check which gives us the best accuracy.

4. Deleting the entire column:

If we come across a scenario where most of the values are null (the definition of most varies from person person) for example if a column has more than 75% values as null so there is no sense in keeping that column. In that case we should drop the column.

To summarize,

Most of the data-sets that we will come across have missing values, using an intelligent way to deal with those will result in a model which is accurate and robust. Having a good domain knowledge would also help us in deciding how to deal with these missing values.

Feel free to drop comments or questions below or you can find me on Linkedin.

--

--

Anirudh Dayma
Analytics Vidhya

Machine Learning | Data Science Enthusiast | Technical Writer