Six steps to hone your Data: Data Preprocessing, Part 3

Anushkad

Published in

GoDataScience

5 min readAug 26, 2020

This tutorial answers all the following questions:

What are the missing values?
When does the problem of missing values arise?
Why is it essential to handle missing values?
How are missing numeric values handled?
How are categorical missing values handled?

So far, we have clear intuition on how to import all the required libraries and how to import the respective dataset and separate features and target variables.

What could be the next step?

We have our dataset. Is data structured? Is our dataset accurate and complete?

What if something is inconsistent or some figures are missing?

Well, in this case, the predictions of our ML model might be incorrect.

To avoid this, we need to make sure our dataset is complete with all the values present.

Well, with that problem stated comes in our next preprocessing measure :

Step 3] Handling Missing Values

Why does this problem arise?

The data we receive is rarely Homogeneous.

There are many possibilities for a dataset to end up having missing values.

For example, A one-bedroom house will not include the size of a second bedroom. And thus, the data of the second bedroom will be missing creating a missing value in that dataset.

Why is Handling Missing values important?

Many machine learning algorithms do not support data with missing values.

Most machine learning libraries give an error if you try to build a model using data with missing values.

So you will need to choose one of the strategies to handle missing data values.

Now that we know the reasons data is incomplete and the problems that can incur, we need to devise a mechanism to avoid errors.

How can we do that?

There are two ways to handle missing values in a dataset.

Note: In python, the NULL values, are represented as NaN. So don not get confused between these two, they are the same!

1] Deleting a particular row

In this method, we remove the entire row having missing value or a particular column having more than 75% of missing values.

But what if we are deleting some crucial information that might make our ML model work with more accuracy?

Well, we do not want that.

This method is not suggested unless you are sure you have an adequate amount of data available even after dropping rows and columns.

2] Recommended Option : Imputation

This method is useful for features having numeric data like age, salary, year, etc.

Here, you can calculate the mean, median, or mode of a particular feature or column or row that contains a missing value and replace the result for the missing value.

This method can add variance to the dataset, effectively negating the loss of data.

This method might not be accurate. Nonetheless, it yields better results than dropping the entire row or column.

Let’s walk through a tutorial of handling missing values by the method of Imputation.

For this, we first need to import a famous ML library called sci-kit-learn.

We will be using the class Imputer from this library to handle missing data.

Let’s get started with the help of an example dataset shown below.

First, we need to start by our two earlier steps of preprocessing, Importing all necessary libraries, and then the required dataset.

Only, this time we also need to import the sci-kit-learn library’s class Imputer to handle missing data.

The snippet above shows how sci-kit-learn is imported.

Now we have to replace the missing rows with the mean value of all entries in that column. The code below demonstrates how missing values are handled.

The Imputer takes these parameters:

Missing_values: We can either give it an integer or “NaN” for it to find the missing values.
Strategy: Strategy is the method we have used to substitute missing values. In the above example, we have replaced missing values by strategy: ‘mean.’
Axis: We can either assign an axis to 0 or 1, 0 to impute along with columns, and 1 to impute along rows.

Imputation is simply the process of replacing the missing values of our dataset.

We can do this by defining our custom function, or we can simply perform Imputation by using the SimpleImputer class provided by sklearn, as shown above.

Mean, Median based Imputation helps in handling missing numeric values. But, what if some of the categorical values are missing? How can we replace categorical missing values?

Let us take a small dataset containing some categorical data having some missing values.

To replace the missing categorical data value, methods commonly used are:

Most common class method

The missing value is replaced by the Pandas’ value_counts function, as shown below in the snippet.

2. Replacing the Missing data explicitly with ‘Unknown.’

Sometimes, a categorical features value being missing can be crucial information and should be explicitly coded.

In this case, the most common class imputing will result in losing this information. Instead, we can just replace missing categorical values with ‘Missing’ or ‘Unknown.’

Let us consider the same dataset and work out this mechanism. The code to replace missing values is given in snippet.

These are the basic methods of Imputation. You can build your ML models to impute your dataset. However, that is beyond the scope of this tutorial.

We have now completed Step 3 of the Data Preprocessing series. Congratulations on getting a step ahead in your ML journey!

(Image Source: Internet)

Six steps to hone your Data: Data Preprocessing, Part 3

Written by Anushkad