Cleansing Your Data With Pandas

How to remove inconsistencies from your acquired data like a boss.

Published in

We Are Orb

4 min readDec 5, 2017

Not all things that glitter are gold. Same with the data you find. Data collected for crunching can be fairly inconsistent especially manually collected data. If not cleaned, this can lead to errors in your calculation and a total mess for your personal project or the organization you work for. You don’t want that happening. That is why the utmost step after data collection is data cleansing to prevent embarrassment and heartbreaks.

Pandas is an open source python library used for manipulating data. It has powerful data structure and algorithm optimizations and provide an easy to use interface for manipulating data to suit your goal.

In this article , I will demonstrate the basics of using pandas to cleanse your data.

Prerequisites

Basic knowledge in Python
Programming Environment — Jupyter Notebook is the best around. It makes your data project organized. You can use the live version here.
This horrible inconsistent dataset

Steps Involved

Pandas is a really powerful tool. We use just that.

So fire up your jupyter notebook and create a new session by clicking new then python3 on the top right corner.

import pandas as pd   # import the pandas librarydataset = pd.read_csv("location/to/dataset.csv")dataset

Add the code above to your first cell and press shift + enter . This will run the code above in the current cell. You will see this.

Here we added the pandas tool to our environment in the first line by importing it. Then on the second line, we loaded the dataset into memory using the read_csv function and the last line we displayed the data as table so we can take a quick glance of how the data is mapped so we can easily navigate also we find that, there are some inconsistencies with the data like null values. This action just gives us a tip of the iceberg but not the whole story. Next step I will show how to find these inconsistencies with the power of pandas

Finding the missing ones

You can check to find places where there were empty inputs and also see the number of empty inputs. Missing data must be handled depending on the requirements of project. For example we can either consider rows or columns with missing data as bad and take it away or replace the missing data with different values to create balance.

Let’s go !

dataset["STREET ADDRESS"].isnull()

This displays whether a data is missing or not in a Boolean style. True means there is missing data , False shows otherwise

dataset["NO. OVERWEIGHT"].isnull().value_counts()

This output is more intuitive and shows the number of missing and non missing data. This helps you to know the general effect of the inconsistency on your data and take actions.

Taking Action

If we don’t find better replacement for the missing data, the best option is to remove them. We do that by

dataset = dataset["STREET ADDRESS"].dropna()

This function drops all the rows with the missing data like it’s hot(pun intended) and creates a new dataset so we can work on the data effectively.

Exercise

Let’s go deeper with the data set.

Targeting the number of overweight students in each school.

Use lesson learnt from the previous section to find the number of missing values in the NO. OVERWEIGHT column in the dataset.

Filling the gap

You will find that there are missing values in that as well. But we can choose to fill the missing values with the average of all number of overweight students in each school.

no_overweight = dataset["NO. OVERWEIGHT"]
new_data = no_overweight.fillna(no_overweight.mean())

Here you realize that, we used average of the non -missing values to fill in the missing values to give it a consistency.

You can use the lessons learnt from the previous section to check if there are still missing data in this column. Try it.

Link to source code here

Where to go from here — Useful Resources:

There are various other methods of cleansing data, you can do string manipulations, joining different datasets, merging values,etc.

Panda’s official doc for data manipulation — and see all the other different actions you can take against the data to make it usable for your goal.

Kaggle’s dataset — The best place to find open data and start exploring and publish your results.

Show us some ❤ and 👏 and follow our publication for more awesome articles on data science from authors 👫 around the globe and beyond