5 WAYS TO HANDLE MISSING VALUES IN PYTHON

Ibekwe kingsley
Machine learning Mastery
4 min readMar 16, 2022

NAN NAN NAN NAN NAN NAN NAN

….Ok you are stuck now😢

How do I handle this columns with missing values?

In this post I will be sharing with you 5 proven ways you can apply to handle columns with missing values in your dataset.

Lets dive in!!

What are the causes of missing values: Missing values occur as a result of incomplete extraction, lack of information, corrupt data, incomplete data from engineers or due to confidentiality of some data. Let’s look at different ways of imputing missing values.

To understand better we will be using titanic data set and some python libraries like pandas, numpy and Scikit learn

Titanic dataset
Overview of missing columns
  1. Deleting Rows or Columns:

In this method, we delete rows and columns that have missing values. This method is commonly used in Data science but should only be used when you have rows and columns with more than 50% of missing values and when they are enough sample in the dataset. Though it is not always advisable to delete columns with missing values because sometimes it may lead to loss of information or data.

Removing missing values by deleting Rows or Columns

2. Replacing with Mean/Mode/Median:

This measures of Central Tendency can save your life 😍 😊…

This method is applied on columns that have numerical values but mode can work on columns with categorical values as well. We can calculate the mean, median or mode of the feature and replace it with the missing values(don’t worry pandas can handle that)

This way, you’re not dropping these columns from the dataset, and you’re also not skewing the salary values.

While not perfect, this method allows you to introduce values that don’t impact the overall dataset, since no matter how many averages you add, the average stays the same.

Replacing with Mean
Replacing with Median and mode

3. Filling missing value with a Constant:

You could also decide to fill the NA-marked values with a constant value. For example, you can put in a special string or numerical value

filling with a constant

4. Predicting the missing values:

Using the features which do not have missing values, we can predict the nulls with the help of a machine learning algorithm. This method may result in better accuracy, unless a missing value is expected to have a very high variance. We will be using linear regression to replace the nulls in the feature ‘age’, using other available features. One can experiment with different algorithms and check which gives the best accuracy instead of sticking to a single algorithm.

Predicting with Linear Regression

5. Using Algorithms which support missing values:

They are many machine learning models that can work with missing values effectively without need to remove the missing values.

  1. KNN: KNN is a machine learning algorithm which works on the principle of distance measure. This algorithm can be used when there are nulls present in the dataset. While the algorithm is applied, KNN considers the missing values by taking the majority of the K nearest values. In this particular dataset, taking into account the person’s age, sex, class etc, we will assume that people having same data for the above mentioned features will have the same kind of fare. You can take a look at the KNN DOCUMENTATION
  2. Random Forest Model: This model produces a robust result because it works well on non-linear and the categorical data. It adapts to the data structure taking into consideration of the high variance or the bias, producing better results on large datasets. You can take a look at the RANDOM FOREST DOCUMENTATION
  3. LIGHTGBM: lightGBM will ignore missing values during a split, then allocate them to whichever side reduces the loss the most. You can take a look at the LIGHTGBM DOCUMENTATION

Handling missng variables while working on datasets can really help in improving your model performance and most importantly help you in getting the best of your dataset.

I hope this solves your problem? 😍 😊…

--

--