Missing Values In Pandas DataFrame

Sachin Chaudhary
Geek Culture
Published in
6 min readDec 13, 2021
Photo by Ross Stone on Unsplash

In this article we will discuss about missing values in Pandas DataFrame. We will discuss about the following:

  • Why missing values occurs and reasons?
  • How to check for missing values?
  • How to handle missing values?
  • Ways to manage missing values.

Why missing values occurs?

When we acquire data for analysis then there are possibilities that your data contains missing and duplicate values. For good analysis you need to remove duplicates and fill missing values so that your final result would be precise.

Missing data can occur when no information is provided for items. these missing values creates vey big problems in real life scenarios.

These missing values can impact the model in which the data is being feed.

Reasons for missing values can be random, intentionally and by mistake well.

  • Loss of information
  • disagreement of data uploading
  • Data Unavailability at the time of creation of DataFrame.
  • Exists but not collected
  • Data does not exists

In Pandas missing values are denoted by NaN and None both.

How to check for missing values

In Pandas missing values are represented by NaN. Most of the times missing values and null values are interchangeable.

For basics Pandas library offers two functions two detects missing values.

  • isnull()
  • notnull()
  • isna( )
  • notna( )

The below two are similar to above two’s.

.isnull( ) is similar to isna( ) and notnull( ) is similar to notna( ).

To evaluate the missing values we use these above functions and that returns boolean value. These methods can be applied on Series and DataFrame both.

for example:

Step 1. First import the library then read the file data. For this article I choose a file with some missing values. You can load that data to look out the the data DataFrame holds.

Step 2. Now we will check the data rows each column contains and try to figure how many values are missing in each column.

As we can see the values for each columns.

Here, Comments column has only 24 values and rest are NaN which is missing values. We can check the entries column wise as well.

For the column “StartTime” the values are not missing because total number of rows are 205 in DataFrame and there are 205 values in StartTime column when we applied count( ) function on columns.

But For the columns like “FuelEconomy” and “Comments” there are some values which are missing.

Step 3. Now we will check which values are missing with help of functions.

isnull( ) function gives boolean value True for missing values and False for valid values.

notnull( ) function gives boolean value False for missing values and True for valid values

To check directly whether the DataFrame contains missing values or not, we can check that by using values.any( ).

It shows that there are some missing values in the DataFrame.

We can check same for Pandas Series as well.

The isnull() method is useful, but sometimes we want to evaluate whether any value is missing in a Series.

Chaining multiple methods together and it is the fastest method to perform is .values.any():

We can know that how many missing values are there by extending chaining methods or adding more method.

we can use .sum( ) and chain with rest of the methods.

In the DataFrame only two columns have missing values in cell number 20.

In cell number 21 Series ser has only one missing value .

Now we will see how we can handle missing values?

How to handle missing values?

There is not an optimal way to handle missing values. But depending on the characteristics of the dataset and the task, we can choose to:

  • Drop the missing values
  • Replace the missing values
  • Fill the missing values

Drop the missing values simply means delete those records in which missing values occurred. For that we have dropna( ) function and its syntax is below.

these settings are by default
dropna(axis = 0, how = 'any',thresh = None, inplace = False )

Now we will watch each and every parameter with the help of examples.

All the records or we can say all the rows which contains missing values has ben deleted.

Now with parameters

  1. axis. It can be 0 and 1. 0 for rows and 1 for columns.
  2. how. For this you can pass ‘any’ or ‘all’.
  3. thresh. It is the minimum number of valid entries or non missing rows which should be present in each row. Its value can be integer.
  4. subset. Define in which columns to look for missing values.
  5. inplace. It helps you to make changes permanent or temporary. True for permanent changes and False for temporary changes.

thresh means →Keep only the rows with at least N non-NaN values:

And when we change axis parameter my column numbers get reduced because two column contains missing values and those columns has been deleted now.

  • subset parameter. When you apply this parameter then dropna( ) function will delete the rows from the given column names only.

Another thing which you can do to handle missing values is replacing values by replace( ) function. The replace() method searches the entire DataFrame and replaces every case of the specified value.

The last treatment which we can use to handle missing values is fillna( ) method.

fillna(value=None, method = None, axis = None, inplace = False, limit = None, downcast = None)

We will learn these parameters through examples.

The simplest example of fillna( ) function without any parameter. It fills 1 or whatever you pass in fillna() function at every NaN values.

Now in below example I am filling the values in specified columns only.

method parameter. In method parameter we can pass ffill, bfill, pad, backfill, None.

for limit parameter. If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

interpolate( ): One more way to handle missing values in DataFrame is interpolate( ) function. It replace NaN values with the number between the previous and next row.

Interpolate simply means estimating an unknown with the help of known quantities.

Syntax and it’s parameters.

DataFrame.interpolate(method=’linear’, axis=0, limit=None, inplace=False, limit_direction=None, limit_area=None, downcast=None, **kwargs)
  • Thank you everyone for giving your time to give a read.
  • Share if you find it useful so that others can learn.

--

--

Sachin Chaudhary
Geek Culture

Computer Science Student| Junior Data Scientist| Learner