Handling Missing Values — Data Science
DataSet and Notebook used in this article can be found here:
Complete Notebook Link : Handling Missing Values
DataSet Link: Melbourne Housing Dataset
Introduction
A perfect data set is usually a big win for any data scientist or machine learning engineer. Unfortunately, more often than not, datasets to be used to solve different data science use-cases will have missing data. Missing data simply means that some cells in our data set are empty.
Causes of Missing Data
- Data is not being intentionally filled especially if it is an optional field.
- Data being corrupted.
- Human error.
- If it was a survey, participants might quit the survey halfway.
- If data is being automatically by computer applications, then a malfunction could cause missing data. Eg. a sensor recording logs malfunctioning.
- Fraudulent behavior of intentionally deleting data.
As you analyze your dataset, you need to wear your investigative hat. This means that if you are in a position to enquire about the causes of the missing data from the data custodian, then do it. If not, it is still okay. Regardless of whether you know or don’t know the cause of missing data, you still need to identify your missing values and handle them.
The Problem with Missing Values
Most machine learning algorithms are unable to handle incomplete data. For some algorithms it could be fine, but in others, it can cause results to deviate from the true result.
Handling Missing Values
How you handle missing values will depend on your problem statement and your data. There are 3 main ways to handle missing values.
- Removing the columns having missing values — If you have a column with more than 80% missing values, then it is better to drop the column.
- Removing the rows having missing values — If you have a small percentage of rows with missing values, then you can just drop those rows. If say you have 1 Million records, and out of that just 20 rows have missing values, then you can drop those rows since they won’t take away a lot of information from your data.
- Imputing the missing values — This is usually the preferred way of handling the missing values, especially where the missing values are not more than 80%. Imputing means your fill the missing values with a valid value ,for instance replacing missing values with the average, mode, or median.
Handling Missing Values in Pandas
Pandas writes the value NaN
(Not a Number) when it finds a missing value.
Checking Missing Values
To check missing values in Pandas we use the isnull function. The result is, boolean values of True and False. We use isnull function in combination with sum function to get a count of the missing values, or simply,where the boolean result is True.
Dealing with Missing Values alternatives
- Deleting rows that have missing values. If the records contain a few missing values, we can drop all the rows having any missing values. This technique can lead to loss of information hence should not be used when the count of rows with missing values is high.
2. Deleting columns with missing values. If we have columns that have extremely high missing values in them(say 80% of the data in the columns is missing), then we can delete these columns. To do this you can first get the percentage of the missing values in your columns.
With the above output, you can now decide to set a threshold of the percentage of the missing values you want to delete. For this scenario, I will delete columns with more than 55% missing values.(An ideal number would be 80%)
From the output Building Area and YearBuilt are the columns we will delete using the drop function.
3. Replacing missing values with mean,mode or median(Numerical data) — This technique works better with numerical features. However note, that this method can add variance and bias error.
We use Pandas fillna function to do this.
Your business rules could also have specified values to fill your missing values with.You can just specifcy the value directly as well, for instance, I am replacing BuildingSize with size 100 by default for all missing values.
3. Replacing missing values with most frequent or constant value(Categorical data)
Imputing using mean,mode and median works best with numerical values. For categorical data, we can impute using the most frequent or constant value.
Let’s use the sklearn impute package to replace categorical data with the most frequent value by specifying strategy=’most_frequent’
The code above will replace Regionname with the most frequent region.
Let’s use the sklearn impute package to replace categorical data with a constant value by specifying strategy=’constant’. You also need to include which value is going to be filler by specifying the fill_value. In our case we are going to fill missing values in column ‘CouncilArea’ with a value called ‘other’. This technique can also be used when there are set business rules for the use case in context.
You can now see below we have a new category call other added to the ‘CouncilArea’ column.
4. Predicting missing values using algorithms — If you want to replace a categorical value, use the classification algorithm. If predicting a continuous number, we use a regression algorithm.This method is also good because it generates unbiased estimates.
DataSet and Notebook used in this article can be found here:
Complete Notebook Link : Handling Missing Values
DataSet Link: Melbourne Housing Dataset