Statistical Analysis in Python using Pandas
In the next few minutes, we shall get ‘Pandas’ covered — An extremely popular Python library that comes with high-level data structures and a wide range of tools for data analysis that every Machine Learning practitioner must be familiar with!
“Pandas aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python” — Pandas’ Mission Statement
Salient Features of the Library —
- Fast and efficient data manipulation with integrated indexing
- Integrated tools for reading/writing in various formats — CSV, text files, MS Excel, SQL, HDF5 etc.
- Smart data-alignment, integrated handling of missing values
- Flexible in terms of reshaping/pivoting datasets
- Supports slicing, fancy indexing and subsetting of huge datasets
- Size mutability
- High performance in merging/joining data
- Hierarchial axis indexing
- Time series functionality
- Optimized performance
- Last but not the least, it’s an Open-Source Python library
For Frequently Asked Questions on Pandas, refer Pandas Documentation
Getting Started with Pandas
Pandas Installation
pip install pandas
conda install pandas #for Anaconda
Refer pandas· PyPI for troubleshooting
Importing Pandas
import pandas as pd
Loading Data
df = pd.read_csv('Data.csv') #Any local folder/link
Find the dataset here — Source: Kaggle (Predict the 2016 NCAA Basketball Tournament)
Useful Operations
- head()/tail()
df.head() #returns the first 5 rows of the datasetOut:
df.tails() #returns the last 5 rows of the datasetOut:
- shape()
df.shape #returns the dimensions of the dataframeOut:
(145289, 8)
- tolist()
df.columns.tolist() #extract all the column names as a listOut:
['Season', 'Daynum', 'Wteam', 'Wscore', 'Lteam', 'Lscore', 'Wloc', 'Numot']
- describe()
df.describe() #shows count, mean, std etc. for each columnOut:
- max()
df.max() #returns max value for all columnsOut:
Season 2016
Daynum 132
Wteam 1464
Wscore 186
Lteam 1464
Lscore 150
Wloc N
Numot 6
dtype: objectdf['Wscore'].max() #returns max value for that columnOut:
186
- mean()
df['Lscore'].mean() #returns the mean of that columnOut:
64.49700940883343
- argmax()
df['Wscore'].argmax() #to identify the row indexOut:
24970
- value_counts()
df['Season'].value_counts() #shows how many times each item appears in the columnOut:
2016 5369
2014 5362
2015 5354
2013 5320
2010 5263
2012 5253
2009 5249
2011 5246
2008 5163
2007 5043
2006 4757
2005 4675
2003 4616
2004 4571
2002 4555
2000 4519
2001 4467
1999 4222
1998 4167
1997 4155
1992 4127
1991 4123
1996 4122
1995 4077
1994 4060
1990 4045
1989 4037
1993 3982
1988 3955
1987 3915
1986 3783
1985 3737
Name: Season, dtype: int64
Accessing Values
As per Pandas Documentation, iloc is an “integer-location based indexing for selection by position”
df.iloc[[df['Wscore'].argmax()]]
#to get attributes about the game, we need to use the iloc[] functionOut:
Let’s take this a step further. Let’s say you want to know the game with the highest scoring winning team (this is what we just calculated), but you then want to know how many points the losing team scored.
df.iloc[[df['Wscore'].argmax()]]['Lscore']Out:
24970 140
Name: Lscore, dtype: int64
When you see data displayed in the above format, you’re dealing with a Pandas Series object, not a dataframe object.
type(df.iloc[[df['Wscore'].argmax()]]['Lscore'])Out:
pandas.core.series.Seriestype(df.iloc[[df['Wscore'].argmax()]])Out:
pandas.core.frame.DataFrame
The following is a summary of the 3 data structures in Pandas:
Haven’t ever really used Panels yet!
When you want to access values in a Series, you’ll want to just treat the Series like a Python dictionary, so you’d access the value according to its key (which is normally an integer index)
df.iloc[[df['Wscore'].argmax()]]['Lscore'][24970]Out:
140df.iloc[:3]Out:
df.loc[:3]
Notice the slight difference in that iloc is exclusive of the second number, while loc is inclusive.
Below is an example of how you can use loc to achieve the same task as we did previously with iloc.
df.loc[df['Wscore'].argmax(), 'Lscore']Out:
140df.at[df['Wscore'].argmax(), 'Lscore']Out:
140
Sorting
Let’s say that we want to sort the dataframe in increasing order for the scores of the losing team.
df.sort_values('Lscore').head()Out:
df.groupby('Lscore')Out:
<pandas.core.groupby.DataFrameGroupBy object at 0x106828650>
Filtering Rows Conditionally
Now, let’s say we want to find all of the rows that satisfy a particular condition.
For example, I want to find all of the games where the winning team scored more than 150 points.
The idea behind this command is you want to access the column ‘Wscore’ of the dataframe df (df[‘Wscore’]), find which entries are above 150 (df[‘Wscore’] > 150), and then returns only those specific rows in a dataframe format (df[df[‘Wscore’] > 150]).
df[df['Wscore'] > 150]Out:
This also works if you have multiple conditions. Let’s say we want to find out when the winning team scores more than 150 points and when the losing team scores below 100.
df[(df['Wscore'] > 150) & (df['Lscore'] < 100)]Out:
Grouping
Another important function in Pandas is groupby(). This is a function that allows you to group entries by certain attributes (e.g Grouping entries by Wteam number) and then perform operations on them.
The next command groups all the games with the same Wteam number and finds where how many times that specific team won at home, on the road, or at a neutral site.
df.groupby('Wteam')['Wscore'].mean().head()Out:
Wteam
1101 78.111111
1102 69.893204
1103 75.839768
1104 75.825944
1105 74.960894
Name: Wscore, dtype: float64df.groupby('Wteam')['Wloc'].value_counts().head(9)Out:
Wteam Wloc
1101 H 12
A 3
N 3
1102 H 204
A 73
N 32
1103 H 324
A 153
N 41
Name: Wloc, dtype: int64df.valuesOut:
array([[1985, 20, 1228, ..., 64, 'N', 0],
[1985, 25, 1106, ..., 70, 'H', 0],
[1985, 25, 1112, ..., 56, 'H', 0],
...,
[2016, 132, 1246, ..., 77, 'N', 1],
[2016, 132, 1277, ..., 62, 'N', 0],
[2016, 132, 1386, ..., 74, 'N', 0]], dtype=object)
"""Now, you can simply just access elements like you would in an array"""df.values[0][0]Out:
1985
Dataframe Iteration
In order to iterate through dataframes, we can use the iterrows() function. Below is an example of what the first two rows look like.
Each row in iterrows is a Series object.
for index, row in df.iterrows():
print row
if index == 1:
breakOut:
Season 1985
Daynum 20
Wteam 1228
Wscore 81
Lteam 1328
Lscore 64
Wloc N
Numot 0
Name: 0, dtype: object
Season 1985
Daynum 25
Wteam 1106
Wscore 77
Lteam 1354
Lscore 70
Wloc H
Numot 0
Name: 1, dtype: object
Extracting Rows and Columns
The bracket indexing operator is one way to extract certain columns from a dataframe.
df[['Wscore', 'Lscore']].head()
"""The bracket indexing operator is one way to extract certain columns from a dataframe"""Out:
df.loc[:, ['Wscore', 'Lscore']].head()
#you can acheive the same result by using the loc functionOut:
type(df['Wscore']) #difference between both operationsOut:
pandas.core.series.Seriestype(df[['Wscore']])Out:
pandas.core.frame.DataFrame#only difference is the brackets arounddf[0:3]Out:
You’ve seen before that you can access columns through df[‘col name’]. You can access rows by using slicing operations.
df.iloc[0:3,:] #Here's an equivalent using ilocOut:
Data Cleaning
The following isnull function will figure out if there are any missing values in the dataframe, and will then sum up the total for each column.
In this case, we have a pretty clean dataset.
df.isnull().sum()Out:
Season 0
Daynum 0
Wteam 0
Wscore 0
Lteam 0
Lscore 0
Wloc 0
Numot 0
dtype: int64
- dropna() — This function allows you to drop all(or some) of the rows that have missing values.
- fillna() — This function allows you to replace the rows that have missing values with the value that you pass in.
Visualizing Data
An interesting way of displaying Dataframes is through matplotlib.
import matplotlib.pyplot as plt
%matplotlib inline
#import matplotlib, a popular library for Data Visualizationax = df['Wscore'].plot.hist(bins=20)
ax.set_xlabel('Points for Winning Team')Out:
<matplotlib.text.Text at 0x113ca8ed0>
Creating Kaggle Submission CSVs
This isn’t directly Pandas related, but I assume that most people who use Pandas probably do a lot of Kaggle competitions as well.
As you probably know, Kaggle competitions require you to create a CSV of your predictions. Here’s some starter code that can help you create that csv file.
import numpy as np
import csv
results = [[0,10],[1,15],[2,20]]
results = pd.np.array(results)
print resultsOut:
[[ 0 10]
[ 1 15]
[ 2 20]]firstRow = [['id', 'pred']]
with open("result.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(firstRow)
writer.writerows(results)
Other Useful Functions
- drop() — This function removes the column or row that you pass in (You also have the specify the axis).
- agg() — The aggregate function lets you compute summary statistics about each group.
- apply() — Lets you apply a specific function to any/all elements in a Dataframe or Series.
- get_dummies() — Helpful for turning categorical data into one-hot vectors.
- drop_duplicates() — Lets you remove identical rows.
Additional Resources
Pandas has been around for a while and there are a lot of other good resources if you’re still interested in getting the most out of this library.
- http://pandas.pydata.org/pandas-docs/stable/10min.html
- https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python
- http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/
- https://www.dataquest.io/blog/pandas-python-tutorial/
- https://drive.google.com/file/d/0ByIrJAE4KMTtTUtiVExiUGVkRkE/view
- https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y
Do check out my GitHub Repositories for more implementations using Pandas —
Also, do not forget to go through — Pandas Documentation