Get Started With Pandas In 5 mins

A tutorial walkthrough of Python Pandas Library

Bhavani Ravi
Jan 10, 2019 · 7 min read

For those of you who are getting started with Machine learning, just like me, would have come across Pandas, the data analytics library. In the rush to understand the gimmicks of ML, we often fail to notice the importance of this library. But soon you will hit a roadblock where you would need to play with your data, clean and perform data transformations before feeding it into your ML model.

Why do we need this blog when there are already a lot of documentation and tutorials? Pandas, unlike most python libraries, has a steep learning curve. The reason is that you need to understand your data well in order to apply the functions appropriately. Learning Pandas syntactically is not going to get you anywhere. Another problem with Pandas is that there is that there is more than one way to do things. Also, when I started with Pandas it’s extensive and elaborate documentation was overwhelming. I checked out the cheatsheets and that scared me even more.

In this blog, I am going to take you through Pandas functionalities by cracking specific use cases that you would need to achieve with a given data.

Setup and Installation

Create virtualenv

virtualenv -p python3 venv
source venv/bin/activate

Install Pandas

pip install pandas

Jupyter Notebook

pip install jupyter
jupyter notebook

Jupyter by default runs in your system-wide installation of python. In order to run it in your virtualenv follow the link and create a user level kernel https://anbasile.github.io/programming/2017/06/25/jupyter-venv/

Sample Data

Load data into Pandas

From CSV File

import pandas
df = pandas.read_csv("path_to_csv")

From Remote URL

import pandas
df = pandas.read_csv("remote/url/path/pointing/to/csv")

From DB

db = # Create DB connection object 
cur = db.cursor()
cur.execute("SELECT * FROM <TABLE>")
df = pd.DataFrame(cur.fetchall())

Each of the above snippets reads data from a source and loads it into Pandas’ internal data structure called DataFrame

Understanding Data

# 1. shows you a gist of the data
df.head()
# 2. Some statistical information about your data
df.describe()
# 3. List of columns headers
df.columns.values

Pick & Choose your Data

Indexes

Selecting Columns

# 1. Create a list of columns to be selected
columns_to_be_selected = ["Total", "Quantity", "Country"]
# 2. Use it as an index to the DataFrame
df[columns_to_be_selected]
# 3. Using loc method
df.loc[columns_to_be_selected]

Selecting Rows

# 1. using numerical indexes - iloc
df.iloc[0:3, :]
# 2. using labels as index - loc
row_index_to_select = [0, 1, 4, 5]
df.loc[row_index_to_select]

Filtering Rows

1. Total sales > 200000
df[df["Total"] > 200000]
2. Total sales > 200000 and in UK
df[(df["Total"] > 200000) & (df["Country"] == "UK")]

Playing With Dates

In our sample dataset, the Date_of_purchase is of type string, hence the first step would be to convert them to the DateTime type.

>>> type(df['Date of Purchase'].iloc[0])
str

Converting Column to DateTime Object

>>> df['Date of Purchase'] = pd.to_datetime(df['Date of Purchase'])
>>> type(df['Date of Purchase'].iloc[0])
pandas._libs.tslibs.timestamps.Timestamp

Extracting Date, Month & Year

df['Date of Purchase'].dt.date    # 11-09-2018
df['Date of Purchase'].dt.day # 11
df['Date of Purchase'].dt.month # 09
df['Date of Purchase'].dt.year # 2018

Grouping

Statistical operations

df["Total"].sum()
df[["Total", "Quantity"]].mean()
df[["Total", "Quantity"]].min()
df[["Total", "Quantity"]].max()
df[["Total", "Quantity"]].median()
df[["Total", "Quantity"]].mode()

Now in a real-world application, the raw use of these statistical functions are rare, often you might want to group data based on specific parameters and derive a gist of the data.

Let’s look at an example where we look at the country-wise, country & Region-wise sales.

# 1. Country wise sales and Quantity
df.groupby("Country").sum()
# 2. Quantity of sales over each country & Region
df.groupby(["Country", "Region"])["Quantity"].sum()
# 3. More than one aggregation
df.groupby(["Country", "Region"]).agg(
{'Total':['sum', 'max'],
'Quantity':'mean'})

Pivot Table

import numpy as np
df.pivot_table(index=["Country"],
columns=["Region"],
values=["Quantity"],
aggfunc=[np.sum])

Another advantage of the Pivot Table is that you can add as many dimensions and functions you want. It also calculates a grand total value for you

import numpy as np
df.pivot_table(index=["Country"],
columns=["Region","Requester"],
values=["Quantity"],
aggfunc=[np.sum],
margins=True,
margins_name="Grand Total"
)

Okay, that was a lot of information in 5 minutes. Take some time in trying out the above exercises. In the next blog, I will walk you through some more deeper concepts and magical visualizations that you can create with Pandas.

Every time you start learning Pandas, there is a good chance that you may get lost in the Pandas jargons like index, functions, numpy etc., But don’t let that get to you. What you really have to understand is that Pandas is a tool to visualize and get a deeper understanding of your data.

With that mindset take a sample dataset from your spreadsheet and try deriving some insights out of it. Share what you learn. Here is the link to my jupyter notebook for you to get started.


Did the blog nudge a bit to give Pandas another chance?
Hold the “claps” icon and give a shout out to me on twitter. Follow to stay tuned on future blogs

bhavaniravi

Everything I learn and experience in Tech

Bhavani Ravi

Written by

🔸 Software engineer @kissflow🔸 Code — Speak — Write — Teach 🔸 Python — Chatbots — ML 🔸 WomenInTech 🔸

bhavaniravi

Everything I learn and experience in Tech

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade