Five ways to handle missing values in a dataset

TEAM DATALOGY
The Startup
Published in
8 min readJun 10, 2020

By : Ramandeep Singh

When one starts to learn the art of working with Data, one of the most frequent problems he or she comes across is handling missing values in a dataset. Missing values may or may not impact the accuracy of the model that you are going to develop. Neither the less, it is important to handle them and it just takes some practice and common sense. It is a problem that ‘will’ be faced by anyone working with data (Because we think anyone can learn to work with data if they are willing to learn!). If we do not handle the missing data properly, we might end up drawing an inaccurate inference from the data.

Firstly, let us try to understand what are missing values?

Let us take a scenario, you are at your favorite restaurant and had an amazing dinner with your friends. When you are about to leave, the waiter hands over to you a form (or a Tablet in some cases) and requests you to fill a survey/feedback form. However, you are receiving back to back calls from your home and a constraint of time comes into the picture. In the process, we sometimes forget to fill in some details because of the lack of time or we just do not feel they are of much importance. And, many people miss filling out some fields or attributes.

In order to analyze the data, now either we can delete/skip these records or we can explore the possibility of filling them (which is also called imputing).

However, even if it seems convenient, we cannot go about deleting the records. Allow me to explain, maybe it is important to know for the restaurant to know if a person is employed or not. Maybe it is a possibility, employed people spend more money at the restaurant compared to counterparts like students.

We will first mention some techniques to handle missing values and then we will try to practice it on a dataset which is a semi-cleaned version of the infamous Titanic Dataset. Feel free to follow along and download the dataset from the GitHub link: https://github.com/Raman-rd/Handling-missing-values

How to find the total number of missing values in a dataset?

The very first and obvious method to deal with the missing values:

1. Dropping records with missing values

We can delete the records with missing values BUT only if we have a very huge dataset because otherwise it may lead to information loss and our model might miss out on important information and maybe it won’t perform as expected. Look at the following table as an example (Nan stands for missing values).

Notice that the third record has 3 missing values except one. If we have a huge dataset, maybe it is for the best if we delete this record because otherwise, we may have to estimate and fill the rest of the values. This method only makes sense if you have a lot of data so that the information is not missed out. Most of the resources we checked, suggest if more than ~70–75% of the data is missing we should drop that feature.

2. Training a separate model to predict missing values

This method is quite interesting. Again, we will see a hypothetical example:

Uh-oh! Check out the First Column (Feature 1), there are two missing values.

What if we train a separate model on columns where values are not missing (Row 2&Row 3) and take Feature 1 as target class by using the rest of the rows as features? Nothing wrong with it, right? This is pretty much this method says: Simply, if the data is available, predict the missing values.

Some packages like RandomForest handles missing values on their own by dropping them or filing them with median or mode.

Decision Trees algorithms (like ID3) ignore the missing values by ignoring them or treating them like a separate category.

Let us cover an example of predicting missing values for the ‘Age’ feature in the Titanic dataset using Linear Regression:

3. Using Statistical Techniques to fill missing values

Finding out the mean, median, or mode and filling the missing values.

Mean: Replace missing values with the average. We choose to mean if the feature is continuous in nature.

Median: The middle value when we arrange the values in ascending order. If we replace, the missing values by their median, it might affect the mean of the values.

Mode: Replacing the missing values with values having the highest frequency. If the values are missing in the categorical column, we fill them with their mode because it does not make sense to compute the mean of 1s and 0s.

Let us move on to the fun part: Practice with a dataset. The dataset is the infamous Titanic dataset in which we have to predict whether a passenger would be able to survive or not based on some features.

As mentioned before, feel free to follow along and open your Jupyter Notebook!

Examining the TITANIC DATASET

It is an infamous dataset available on Kaggle, passengers are classified into binary categories(1 = ‘Survived’ & 0 = ‘Not Survived’)

Let us import the libraries we will need:

Import the dataset:

Printing the top 5 rows for observation of the data frame:

We can already see some Nan Values

We will plot a heatmap to have a bird-eye’s view of all the missing values using the seaborn library.

We called heatmap from seaborn and asked to show only missing values as colored in the heatmap.

We can note that the ‘Age’ column has ~20% missing values. Also, the ‘Cabin’ column has a relatively large number of missing values.

Consider the ‘Age’ Column, we can drop it, however, maybe Age had a significant effect on whether a passenger survived or not. If that is the case, then we will miss out on an important feature.

Before making this hard decision, let us do some EDA to gain further insights.

We will now plot a count plot to check the number of passengers who survived and who did not.

Also, we will check if the survival based on the gender of a passenger.

As we can notice that from the passengers who did not survive (Class 0), the majority consists of males. Same as the ending of the movie.

Similarly, maybe the Class Passenger was traveling in (Pclass) played a role in deciding if a passenger survived or not?

Looks like Passengers in Pclass 3 had the largest number of people who were not able to survive.

We will now calculate the mean age of every Pclass and impute the missing values accordingly!

To visualize the same, we will plot a boxplot using seaborn.

Now we will just write a function to the impute the missing values with the average age of their respective class(which is visible from the above BoxPlot).

We will now apply this function on the Age column.

Let us check the heatmap again for the missing values:

No missing values in the age column!

If we look at the ‘Cabin’ column, ~70% of the values are missing. In this case, a similar approach will not be that useful. Maybe, we can drop it entirely or convert it into another feature like whether if Cabin info is available or not.

On the other hand, it seems we can drop missing values from the ‘Embarked’ column as missing values are very less.

4. Feature Transformation

Some experts suggest it might help sometimes if we do Feature Transformation. If we observe the Cabin Column, we can convert it into a binary column on the basis of whether the Cabin is known (=1) and whether Cabin is now known (=0).

The implementation for the same is below:

The above code will create a new column and add label “1” where the value for the feature is known and “0” where the value is missing.

5. Using IterativeImputer from sklearn.impute

It is used to be a part of the fancyimpute package at first, but slowly it merged into sci-kit learn. However, one can also import the same from fancyimpute but it will be importing from sci-kit learn at the back-end. Iterative imputer imputes the missing values while taking into consideration the other features in an iterative manner. The IterativeImputer is used to impute continuous variables.

At each iteration, a feature column is selected as output y and other columns are treated as a feature ‘X’. Then, a regressor is fit on data (X,y) for y which is used as a training label. The trained regressor is used to then predict the missing ‘y’ values.

The link for the official documentation is below:

This is where we leave you to pursue your journey further!

For more latest data science-related updates, follow Datalogy on LinkedIn:

In this blog, we covered some methods to deal with missing values in Python. Let us know in the comments what techniques do you use to handle missing values

--

--