Data Science for Beginners: Handling Missing Values With Pandas

Peace Ikeoluwa Adegbite
The Startup
Published in
5 min readOct 24, 2020

Missing values can appear as ‘NaN’ (Not a Number), ‘NA’ (Not Available), ‘n/a’, ‘na’, ‘?’, a blank space, an out-of-range value and in many other forms depending on the user(s) filling in the data. In real datasets, missing values are almost unavoidable and they can be caused by several reasons like corrupted data or unrecorded observations.

Learning to handle missing values in a dataset is very important because most machine learning models cannot handle missing values. In this tutorial, you’ll learn how to handle missing values using Pandas. Let’s get started!

Identifying Missing Values

To handle missing data, we need to first identify them. Pandas identifies some missing value forms by default as NaN values (e.g blank entry, ‘NA’, ‘null’, ‘nan’, ‘n/a’, ‘NULL’) but not some others. To make Pandas recognize other non-default missing value forms (e.g ‘?’, ’na’, ‘Nil’), we can make a list of them and pass them into Pandas’ .read_excel() method (since the dataset is an excel file in this case) as is done below:

import pandas as pd
pd.read_excel('Students_Scores.xlsx')

This outputs:

              Maths  English  Physics  Chemistry  Biology
Student ID

SA/2020/001 83.0 NaN 55.0 23.0 100.0
SA/2020/002 ? Nil 76.0 43.0 32.0
SA/2020/003 NaN NaN ? 56.0 54.0
SA/2020/004 56.0 90.0 ? 76.0 88.0
SA/2020/005 76.0 Nil 76.0 54.0 43.0
SA/2020/006 32.0 83.0 NaN 21.0 21.0
SA/2020/007 ? 75.0 NaN 97.0 54.0
SA/2020/008 ? 43.0 NaN NaN NaN
SA/2020/009 NaN 32.0 43.0 NaN NaN
SA/2020/010 NaN 12.0 Nil NaN NaN
missing_value_forms=['?','Nil']df= pd.read_excel('Students_Scores.xlsx',na_values= missing_value_forms, index_col= 'Student ID')df

This gives an output such that all missing value forms used are now recognized by Pandas:

              Maths  English  Physics  Chemistry  Biology
Student ID

SA/2020/001 83.0 NaN 55.0 23.0 100.0
SA/2020/002 NaN NaN 76.0 43.0 32.0
SA/2020/003 NaN NaN NaN 56.0 54.0
SA/2020/004 56.0 90.0 NaN 76.0 88.0
SA/2020/005 76.0 NaN 76.0 54.0 43.0
SA/2020/006 32.0 83.0 NaN 21.0 21.0
SA/2020/007 NaN 75.0 NaN 97.0 54.0
SA/2020/008 NaN 43.0 NaN NaN NaN
SA/2020/009 NaN 32.0 43.0 NaN NaN
SA/2020/010 NaN 12.0 NaN NaN NaN

When Pandas has been made to identify all missing value forms, we can then mark them out. To mark out missing values in a given dataset, Pandas isnull() and notnull() functions come in handy. Pandas isnull() marks all missing values as True while notnull() marks all missing values as False.

df.isnull()#Output is:            Maths  English  Physics Chemistry  Biology
Student ID
SA/2020/001 False True False False False
SA/2020/002 True True False False False
SA/2020/003 True True True False False
SA/2020/004 False False True False False
SA/2020/005 False True False False False
SA/2020/006 False False True False False
SA/2020/007 True False True False False
SA/2020/008 True False True True True
SA/2020/009 True False False True True
SA/2020/010 True False True True True

To check for missing values in the dataset or the number of missing values per column:

df.isnull().values.any()# Output is:
True

To check for the number of missing values in each column:

df.isnull().sum()

# Output is:
Maths 6
English 4
Physics 6
Chemistry 3
Biology 3
dtype: int64

Ways of handling missing values

Missing values can be handled by:

1. Deletion

2. Imputation

3. Interpolation

Deletion

Here rows or columns containing missing values are deleted. This is however not a good way to handle missing values especially when there are many missing values in the dataframe as it leads to loss of data.

To delete rows or columns containing missing values, Pandas dropna() function is used:

#to drop rows with missing valuesdf.dropna(axis=0,inplace=True)#to drop columns with missing valuesdf.dropna(axis=1,inplace=True)Note: axis=0 for rows, axis=1 for columns

Imputation

In this method, missing values are replaced by a constant value or they are replaced based on other observations in the dataset. Pandas fillna() function is used for imputing values. Imputation can be done by:

i. Replacing missing values with a predetermined constant value.

To fill all missing values in the dataframe with a constant:

df.fillna(0, inplace=True)#Output is:
Maths English Physics Chemistry Biology
Student ID
SA/2020/001 83.0 0.0 55.0 23.0 100.0
SA/2020/002 0.0 0.0 76.0 43.0 32.0
SA/2020/003 0.0 0.0 0.0 56.0 54.0
SA/2020/004 56.0 90.0 0.0 76.0 88.0
SA/2020/005 76.0 0.0 76.0 54.0 43.0
SA/2020/006 32.0 83.0 0.0 21.0 21.0
SA/2020/007 0.0 75.0 0.0 97.0 54.0
SA/2020/008 0.0 43.0 0.0 0.0 0.0
SA/2020/009 0.0 32.0 43.0 0.0 0.0
SA/2020/010 0.0 12.0 0.0 0.0 0.0

To fill a particular column in the dataframe with a constant:

df['Biology'].fillna(50, inplace=True)# Output is:
Student ID
SA/2020/001 100.0
SA/2020/002 32.0
SA/2020/003 54.0
SA/2020/004 88.0
SA/2020/005 43.0
SA/2020/006 21.0
SA/2020/007 54.0
SA/2020/008 50.0
SA/2020/009 50.0
SA/2020/010 50.0
Name: Biology, dtype: float64

ii. replacing missing values in a column with values from previous rows (forward or backward).

Forward fill: this fills missing values with values from previous rows in a forward manner

#forward filldf['Maths'].fillna(method='ffill')# Output is:
Student ID
SA/2020/001 83.0
SA/2020/002 83.0
SA/2020/003 83.0
SA/2020/004 56.0
SA/2020/005 76.0
SA/2020/006 32.0
SA/2020/007 32.0
SA/2020/008 32.0
SA/2020/009 32.0
SA/2020/010 32.0
Name: Maths, dtype: float64

Backward fill: this fills missing values using values from the later rows in a backward manner

#backward filldf['English'].fillna(method='bfill')# Output is:
Student ID
SA/2020/001 90.0
SA/2020/002 90.0
SA/2020/003 90.0
SA/2020/004 90.0
SA/2020/005 83.0
SA/2020/006 83.0
SA/2020/007 75.0
SA/2020/008 43.0
SA/2020/009 32.0
SA/2020/010 12.0
Name: English, dtype: float64

iii. replacing missing values in a column with the mean, median or mode of that column.

Mean

mean_physics= df['Physics'].mean()
df['Physics'].fillna(int(mean_physics),inplace= True)
# Output is:
Student ID
SA/2020/001 55.0
SA/2020/002 76.0
SA/2020/003 62.0
SA/2020/004 62.0
SA/2020/005 76.0
SA/2020/006 62.0
SA/2020/007 62.0
SA/2020/008 62.0
SA/2020/009 43.0
SA/2020/010 62.0
Name: Physics, dtype: float64

Median

median_english= df['English'].median()
df['English'].fillna(int(median_english),inplace= True)
#Output is:
Student ID
SA/2020/001 59.0
SA/2020/002 59.0
SA/2020/003 59.0
SA/2020/004 90.0
SA/2020/005 59.0
SA/2020/006 83.0
SA/2020/007 75.0
SA/2020/008 43.0
SA/2020/009 32.0
SA/2020/010 12.0
Name: English, dtype: float64

Mode

mode_physics= df['Physics'].mode()
df['Physics'].fillna(int(mode_physics),inplace= True)
#Output is:
Student ID
SA/2020/001 55.0
SA/2020/002 76.0
SA/2020/003 76.0
SA/2020/004 76.0
SA/2020/005 76.0
SA/2020/006 76.0
SA/2020/007 76.0
SA/2020/008 76.0
SA/2020/009 43.0
SA/2020/010 76.0
Name: Physics, dtype: float64

Interpolation

In this method, missing values are handled using Pandas interpolate() function. Interpolation can be done using different methods e.g linear, pad, nearest, quadratic method. Pandas however carries out linear interpolation by default.

inear interpolation
df['Chemistry'].fillna(df['Chemistry'].interpolate())
#Output is:
Student ID
SA/2020/001 23.0
SA/2020/002 43.0
SA/2020/003 56.0
SA/2020/004 76.0
SA/2020/005 54.0
SA/2020/006 21.0
SA/2020/007 97.0
SA/2020/008 97.0
SA/2020/009 97.0
SA/2020/010 97.0
Name: Chemistry, dtype: float64

The way you choose to handle the missing values in your dataset will depend on the type and description of the dataset you are working on.

Voila! You are now ready to handle missing values.

--

--