Cleansing Your Data With Pandas
How to remove inconsistencies from your acquired data like a boss.
Not all things that glitter are gold. Same with the data you find. Data collected for crunching can be fairly inconsistent especially manually collected data. If not cleaned, this can lead to errors in your calculation and a total mess for your personal project or the organization you work for. You don’t want that happening. That is why the utmost step after data collection is data cleansing to prevent embarrassment and heartbreaks.
Pandas is an open source python library used for manipulating data. It has powerful data structure and algorithm optimizations and provide an easy to use interface for manipulating data to suit your goal.
In this article , I will demonstrate the basics of using pandas to cleanse your data.
Prerequisites
- Basic knowledge in Python
- Programming Environment — Jupyter Notebook is the best around. It makes your data project organized. You can use the live version here.
- This horrible inconsistent dataset
Steps Involved
Pandas is a really powerful tool. We use just that.
So fire up your jupyter notebook and create a new session by clicking new
then python3
on the top right corner.
import pandas as pd # import the pandas librarydataset = pd.read_csv("location/to/dataset.csv")dataset
Add the code above to your first cell and press shift + enter
. This will run the code above in the current cell. You will see this.
Here we added the pandas tool to our environment in the first line by importing it. Then on the second line, we loaded the dataset into memory using the read_csv
function and the last line we displayed the data as table so we can take a quick glance of how the data is mapped so we can easily navigate also we find that, there are some inconsistencies with the data like null values. This action just gives us a tip of the iceberg but not the whole story. Next step I will show how to find these inconsistencies with the power of pandas
Finding the missing ones
You can check to find places where there were empty inputs and also see the number of empty inputs. Missing data must be handled depending on the requirements of project. For example we can either consider rows or columns with missing data as bad and take it away or replace the missing data with different values to create balance.
Let’s go !
dataset["STREET ADDRESS"].isnull()
This displays whether a data is missing or not in a Boolean style. True means there is missing data , False shows otherwise
dataset["NO. OVERWEIGHT"].isnull().value_counts()
This output is more intuitive and shows the number of missing and non missing data. This helps you to know the general effect of the inconsistency on your data and take actions.
Taking Action
If we don’t find better replacement for the missing data, the best option is to remove them. We do that by
dataset = dataset["STREET ADDRESS"].dropna()
This function drops all the rows with the missing data like it’s hot(pun intended) and creates a new dataset so we can work on the data effectively.
Exercise
Let’s go deeper with the data set.
Targeting the number of overweight students in each school.
Use lesson learnt from the previous section to find the number of missing values in the NO. OVERWEIGHT column in the dataset.
Filling the gap
You will find that there are missing values in that as well. But we can choose to fill the missing values with the average of all number of overweight students in each school.
no_overweight = dataset["NO. OVERWEIGHT"]
new_data = no_overweight.fillna(no_overweight.mean())
Here you realize that, we used average of the non -missing values to fill in the missing values to give it a consistency.
You can use the lessons learnt from the previous section to check if there are still missing data in this column. Try it.
Where to go from here — Useful Resources:
There are various other methods of cleansing data, you can do string manipulations, joining different datasets, merging values,etc.
Panda’s official doc for data manipulation — and see all the other different actions you can take against the data to make it usable for your goal.
Kaggle’s dataset — The best place to find open data and start exploring and publish your results.
Show us some ❤ and 👏 and follow our publication for more awesome articles on data science from authors 👫 around the globe and beyond