Data Cleaning in Python using Pandas

Ashita Saxena
Jun 5, 2020 · 3 min read

Why Learn Data Cleaning?

Data scientists can end up doing a wide variety of things across a wide variety of industries, but almost every data science job shares at least one thing in common: data cleaning.

When some part of our data is missing, due to whichever reason, the accuracy of our predictions plummets.Hence, in such a case Data Cleansing comes into picture. With the help of Data Cleansing one can get the accurate results.

According to IBM Data Analytics you can expect to spend up to 80% of your time cleaning data.

Sources of Missing Values

Before we understand the working of code, let’s find out the sources of missing values in the given set of data.

  • User forgot to fill in a field.
  • Data was lost while transferring manually from a legacy database.
  • There was a programming error.
  • Users chose not to fill out a field tied to their beliefs about how the results would be used or interpreted.

Ways to Cleanse Missing Data in Python:

To perform a Python data cleansing, you can drop the missing values, replace them, replace each NaN with a scalar value, or fill forward or backward.

1️⃣ Dropping Missing Values

You can exclude missing values from your dataset using the dropna() method.

Inorder to see the working of dropna() method, first of all let’s create a data frame with random values as shown below:

Creating a dataframe

Now using dropna() method >>> frame.dropna() we get the following result:

dropna( ) method

Hence, with the help of this method all the values will be dropped.

2️⃣ Replacing Missing Values

To replace each NaN we have in the dataset, we can use the replace() method.

>>> from numpy import NaN

>>> frame.replace({NaN:0.00})

replace() method

In the above code, it’s clear that all the missing values are replaced by 0.00 .

3️⃣ Replacing with a Scalar Value

We can use the fillna() method for this.

>>> frame.fillna(7)

fillna() method

4️⃣ Filling Forward or Backward

If we supply a method parameter to the fillna() method, we can fill forward or backward as we need. To fill forward, use the methods pad or fill, and to fill backward, use bfill and backfill.

>>> frame.fillna(method=’pad’)

Conclusion

Hence, in this Python Data Cleansing, we learned how data is Cleans In Python Programming Language for this purpose, we used two libraries- pandas and numpy. Since data scientists spend 80% of their time cleaning and manipulating data, that makes it an essential skill to learn with data science.

Python is the “most powerful language you can still read”.

- Paul Dubois

THANK YOU!!

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Ashita Saxena

Written by

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Ashita Saxena

Written by

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app