How to handle Missing values?

r.aruna devi
Analytics Vidhya
Published in
3 min readMar 29, 2020

Before we start handling missing values.. Let’s get to know why we have missing values.
1. When there is NO INFORMATION.
2. Data exists but it was not recorded.
3. When transferring manual data to database.
4. Forgot to place the value.
5. Voluntarily making few values as null.

  • Create a dataframe and let’s check how missing value look in pandas dataframe.
Missing values are treated as “NaN” Not a Number.

How to check whether data has missing values or not and to find out how many missing values ?

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# load titanic dataset from kaggle.
df = pd.read_csv('titanic_train.csv')
# check number of missing values in each column.
df.isna().sum()

PassengerId has no missing value, whereas Age, Cabin and Embarked has missing values.

Now, let’s see how to handle it.

* Drop rows or columns.
* Fill the missing values with mean, median or mode.
* Make the missing column as target and create a model. Make the missing rows as test dataset and predict the value.

1. Drop records : You can drop the entire record from dataframe.

import pandas as pd 
import numpy as np
dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe from list
df = pd.DataFrame(dict)
# drop the entire record using dropna()
# axis 0: rows
# axis 1: column
df = df.dropna(axis=0, how='any')
df

Records which has null values are dropped.

Drop columns : Drop columns which has more missing value.

df.drop(['Score4'],axis=1,inplace=True)

Column Score4 has more null values.So, drop the column. When column has more than 80% to 95% missing value, drop it.

2. Fill the missing values using fillna(), replace().

For categorical column, take most frequent value and fill the missing value.

df['Score5'].value_counts()OUT:
Male 3
Female 1
Name: Score5, dtype: int64
#fill missing value with maximum repeated values.
df.fillna("Male",inplace=True)

Alternatively, try using imputer method from sklearn.preprocessing.

Imputer :

from sklearn.preprocessing import Imputer 
imputer = Imputer(missing_values = "NaN",
strategy = "mean",
axis = 0)
# strategy = mean or median or mode
# axis = 0 or 1
imputer = imputer.fit(df)
df = imputer.fit_transform(df)
df
array([[100. , 30. , 72.66666667],
[ 90. , 45. , 40. ],
[ 95. , 56. , 80. ],
[ 95. , 43.66666667, 98. ]])

Imputer replaces the NaN value with Mean or Median or Mode.

Conclusion:

  1. If missing values in target column, drop the record by dropna.
  2. If more than 70 % missing values in a column, drop the record by drop.
  3. Fill the missing by using fillna.
  4. fillna : take mean for integer column.
  5. fillna : take mode for categorical column.

--

--