Dealing with Missing Values

Eliud Nduati
Analytics Vidhya
Published in
5 min readAug 5, 2021

Part 1 in A Data Cleaning Journey

You made it to the first article in “A data cleaning journey” series!

Image from Hub research

No one likes data cleaning. It can be very frustrating. This is especially if you are dealing with missing values and are on a deadline.

You need to give yourself time to handle the data cleaning part of your data science process.

Sometimes you will wonder what to do with the missing values on your dataset.

Analytics India Magazine Pvt Ltd

When you finally have a clean data set, your analysis will run faster and you can draw the insights faster as well.

Refresh on Pandas and Numpy?

First things first. I assume you know how to use pandas and NumPy library. If not here are a few links to good resources you can use to refresh on them:

  1. Introduction to Pandas DataFrames Part 1
  2. Pandas -Kaggle
  3. Introduction to NumPy, Pandas and Matplotlib

First let’s look at the data we’ll be using

We will use pandas to load and look at the data. The dataset can be found here and the notebook here.

First we import the necessary libraries and read in the dataset

#import librariesimport pandas as pd
import numpy as np
#reading in the datasetdata = pd.read_csv("NFL2009-2018 (v5).csv")
data.head()

Our dataset has 255 columns. The snippet above only shows 5 columns. The next step will be to look at how many missing or null values are in the dataset.

Let’s check for our missing datapoints

We do this using the code below. The first line check for the columns with missing values while the second line is used to show only the first 10 columns and how much rows have missing values in them.

# Checking missing valuesmissing_values = data.isnull().sum()
missing_values[0:10]

From the image above, we learn that in the first 10 columns, we have 5 columns with missign values.

Why do we have missing data?

Before handling missing values, it is important to know why the values might be missing.

Photo by Brett Jordan on Unsplash

You might ask, so where does this missing data problem come from?

Some of the missing data might be due to human error. The one entering the data in the records might have made an error and left out some of the data. But this is not the only reasons. in other cases, when answering questions to a survey or a questionnaire, you might be reluctant to provide some data such as household income or number of kittens you have :) Well, this would be a cause of missing data.

Different methods of dealing with missing values.

  1. Dropping missing values

Dropping values is not the best way to deal with missing data. The purpose of your analysis is to draw the most significant insights. if you remove data, you are reducing the significance of the insights you will get. But how do we go about with dropping missing values?

If you decide to drop the missing columns because they have missing values, here is what you do…

# drop columns with missing valuescolumns_without_na = data.dropna(axis=1)
columns_without_na.head()

The above code drops any column with a missing values in it. The results are shown here:

If you remember from above, we initially had 255 columns in our dataset. Droppping missing values has resulted in only 50 columns. That’s a lot of data to drop!

Giphy

It’s worse if we decide to drop all rows with missing values

This removes all the rows in our dataset

2. Imputing missing values

imputing, put simply, is replacing the missing values with appropriate data as you wish. However, regardless of the algorithm or method you decide to use to impute the missing values, it will also lead to lose of information.

You are introducing a value where one was missing. “Missing values” are in themselves a piece of information. imputing values reinforces the patterns but does not necessarily enhance the information in the data.

# impute or replace missing values with 0
data.fillna(0)
# fill missing values with a string
data.fillna("data missing")
# fill missing values with mean of specific column
data["col1"] = data["col1"].fillna(data.mean)

Dealing with missing categorical data

When dealing with categorical data, the best approach is to label the missing values as “missing”. However, you can also impute them with the mode of the column.

Thoughts!

In this article, I have summarized where missing data comes from, and the ways to handle them. However, the approach you chose will depend on various factors such as the percentage of missing values, the data types of the missing values and the type of analysis or model you are working on. As a data scientist, you will choose different methods of handling the missing values at different times.

Check out for my next article in this series on Dealing With Unwanted Observations: Duplicates & irrelevant Observations

--

--