Diksha Bellani
5 min readJul 14, 2020

--

HANDLING MISSING VALUES

Imagine buying a box of 30 balls where there are 8 unique colors of balls. After opening the box full of balls unfortunately you find out 3 empty segments. Now can you accurately find a way out of handling the missing balls segments.

Should one just go to the seller and return the box? Should one just go and buy other balls and fill the empty segments? Or One can just predict the color of the balls in the empty segments based on the other arrangement of balls and replace the empty space with the predicted color of balls.

Also referred as NA(Not Available) values in pandas. The real world data that we actually work on is rarely clean and homogeneous. Missing data may occur in different ways to make data even more complicated in various data sources.

For eg: A customer in a hotel may not always share his/her secondary address or Pancard details.

There can be various sources of missing values:

· There was a programming error

· Data was not collected due to a human error

· Data got accidentally deleted

· User chose not to fill out a field in the data

It’s important to understand these different types of missing data from a statistical point of view. The type of missing data will influence how you deal with filling in the missing values. We will be discussing about various methods to deal with the missing values below.

In Pandas missing data is represented in two ways:

1) None-Python uses the keyword None to define null objects and variables.

2) NaN- Stands for not a number, is a numeric data type used to represent any value that is undefined or not representable

There are several functions in Python for detecting, removing and replacing null values in Pandas Dataframe.

Standard Null Values

Standard null values are the values which pandas can detect . In order to find these values we use functions isnull() and notnull(). These functions can also be used in Pandas Series in order to find out the null values.

Isnull() returns a dataframe or Boolean values which are True for NaN values.

Checking null values using isnull()

The above function returns dataframe of Boolean values which are False for NaN values.

Isnull().sum() gives the total number of null values.

Checking null values using notnull(). This function returns dataframe of Boolean values which are False for NaN values.

using notnull()

Non-Standard Missing Values

Sometimes we can see missing values in different formats. For eg: ‘?’, ‘n/a’, ‘NA’ , ‘ — ‘ etc

Let us see how to treat such missing values.

NULL Value Treatment

· Deletion

· Replacing

· Imputation

· Using Algorithms Which Support Missing Values

· Predicting The Missing Values

Deletion or Dropping

Dropping specific rows with the missing values or dropping the entire column. Dropna() will by default drop all the rows with missing value.

Data.dropna(axis=1) will drop the entire column containing any null values.

IMPUTATION

Imputation i.e filling the null or missing values. This is the most common way of handling the missing data. One can fill the missing/null values by a forward-fill or backward-fill. Forward-fill means filling the data with the previous value therefore it does not work on the first row and Backward-fill function does not work on the last row.

Backward Fill

Forward Fill

Mean, Mode, Median Imputation

One of the simplest methods to impute missing values include filling in a constant or the mean of the variable or other basic statistical parameters like median and mode. We use mode imputation in case of categorical data and median/mean imputation in case of numerical data.

Similarly we can perform Mode/Median imputation as well.

Replacing

Sometimes we just want to fill the missing values with some constant value. In this case we just replace the NULL /missing value with that constant.

Using Algorithms Which Support Missing Values

KNN is a machine learning algorithm which works on the principle of distance measure. This algorithm can be used when there are nulls present in the dataset. While the algorithm is applied, KNN considers the missing values by taking the majority of the K nearest values.

Another algorithm used is Random forest which is an attractive approach for imputing missing data. It has the desirable properties of being able to handle mixed types of missing data.

Both the above algorithms can be used in case of numerical features.

using KNNImputer()

The output of a KNN Imputer will always be in the form of a numpy array.

Predicting The Missing Values

Using the features which do not have any null values we can build a machine learning model to predict the missing values. This method may result in a better accuracy. . We will be using linear regression to replace the nulls in the feature ‘age’, using other available features. One can experiment with different algorithms and check which gives the best accuracy instead of sticking to a single algorithm.

Conclusion

Every data set we come across will almost have some missing values which need to be dealt with. The approach to deal with missing values is heavily dependent on the nature of such data. But handling them in an intelligent way and giving rise to robust models is a challenging task. We have gone through a number of ways in which nulls can be replaced. As is common, imputing missing values allows us to improve our model compared to dropping those columns.

Hope you enjoyed reading the blog!! Please share your feedback and topics that you would like to know about.

--

--