Data Analytics using Pandas: A Quick Practical Tutorial 1
Pandas is a magnificent tool that empowers python enormously for data analytics. It is able to read and transform structured data in tons of ways. This article will take you through some practical Pandas data transformation using English Premier League season 2014/2015 results as dataset.
This practical tutorial get you straight to the functions and methods of Pandas to manipulate data and DataFrames without much of syntax and option explanations. I tried to focus on how Pandas functions work for us to get the answer we need. I leave the readers to find out a huge arrays of options and varieties of other syntax patterns from other sources by themselves.
I summarize Pandas methods/functions learnt (and what they do) for each section in Pandas Summary subsection.
The Dataset
The explanation of column names and abbreviations can be read here. Some abbreviations which will be used in this tutorial are:
Coding Environments
I simply installed Anaconda 5.2
for MacOS High Sierra
(64-bit with Python 3.6) on my system. I use Jupyter Notebook
as my coding IDE.
I put the E0.csv
under datafile
folder, as shown below.
Data Preparations
Loading The E0.csv file
The first step is to create a DataFrame (DF) by importing the .csv
file. Simply use pd.read_csv()
. In the code below, the first DF created from the E0.csv
file is named d0
.
The display()
is Jupyter Notebook’s feature to magically display the table in very eye-pleasing way. Using print()
is ok, but display()
is more beautiful on Notebook.
At the end of the screen shown below, the size of the table is also shown — 381 rows × 68 columns
. The last line showing 381
is the result of the print(len(d0))
command.
Pandas summary:
pd.read_csv()
: to create a DF by importing a csv filelen(d0)
: to count the rows of the DF
Display only the First or the Last few Rows
From the display()
command above, although some rows in the middles are skipped, it is still a long output. We can choose to display only the first or the last few rows using .head()
and .tail()
methods. You can also specify the number of rows you want to display by entering the number of rows as a parameter, e.g., head(10)
or tail(10)
.
Pandas summary:
.head()
and.tail()
: to view only the first few or the last few rows of DF.
Checking for Missing Data
In the image above, showing the result of tail()
method, you might notice that the last line of the data shows380 NaN NaN ... NaN
. The NaN
indicates the missing of data. Let’s examine the file by opening up the E0.csv
file using Jupyter Notebook (or on any text editor, e.g., Sublime), which looks like this:
When you scroll down to the very end, you probably see this.
The line 382 contains only an array of commas with out data in between. Therefore, the line 382 causes the NaN
to show up when we load the file using Pandas.
Further examination on the missing of data can be done using df.info()
command. The df.info()
shows data types and, more importantly, the number of non-null rows (the number rows with valid data) of each column.
When scrolling down slightly and looking carefully, it can be seen that there are 3 unusual columns, namely SJH
, SJD
, and SJA
, with only 40 non-null value (instead of 380 as in other columns).
Let’s take a look at the raw E0.csv
file to see the root cause of the null data. In the file you will see some missing data in some columns, the consecutive commas, as shown in the image below. Thus, these are the missing data of the columns named SJH
, SJD
, and SJA
summarized by the df.info()
above.
Pandas summary:
df.info()
: shows the name, number of non-null rows, and data types of each column.
Discarding (Dropping) Missing Data
Our objective here is to remove the last line (line 380 of Figure 6) containing the missing data. To discard the missing data, we can use df.dropna(how=’any’)
. This removes all rows that contain the NaN
.
However, dropping missing data can be trickier than you think. If we apply the .dropna()
to the entire data set (the d0
here), any rows with the missing data caused by the SJH
, SJD
, and SJA
columns will be removed too. The 380 rows of all columns will be shortened down to only 40 rows, which certainly is NOT what we want.
In our tutorial here, we will use only first few columns and not the SJH
, SJA
, and SJD
. Therefore, we shall select the columns we want to use first, which excludes the SJH
, SJA
, and SJD
columns, then discarding the NaN
rows later.
Pandas summary:
df.dropna(how=’any’)
: removes all rows that contain theNaN
.
Selecting Column(s) from DF
To select ONE column from a DF, we can use df[‘column name’]
just like selecting an element from a List in Python. You can also use df.column_name
format. The first is easier to remember as it is the same as List element selection syntax. It also supports column name with spaces. The latter is slightly easier to write.
The column selection will not affect the original DF (d0
). We have to return the output to a new DF (d1
). The d1 = d0['HomeTeam']
means creating a new DF named d1
by selecting all rows of the column named 'HomeTeam'
from the DF d0
.
To select several columns, use double brackets — df[['name1','name2']]
The code below combines the column selection and dropna()
in one line. Notice that the line 380 with NaN
is removed.
Pandas Summary
df['column name']
ordf.column_name
: select only a specific column of the DF.d0[['Date','HomeTeam','AwayTeam','FTR']]
: select specific columns of the DF.
Data Analytics
Problem 1 : Get the name of all teams played this season
Put simply, get the unique names of all teams. We all know that there must be exactly 20 team names. An algorithm to solve this problem is simple: get a unique team names of either HomeTeam
or AwayTeam
column.
There are few ways of doing this.
First, use .unique()
. Thus print(d1[‘HomeTeam’].unique())
will give you the answer. Of course, print(d1.HomeTeam.unique())
will also work. The output is of type numpy.ndarray
. To ensure correctness, we can use .nunique()
to count the result.
Second, exploit the uniqueness property of Python’s Set. Throw the HomeTeam
names into a set and it will remove duplicates automatically for you. Thus, print(set(d1[‘HomeTeam’]))
will also give you the answer. The output is a set. To count the elements of the Set, we can use len()
, as shown below.
Pandas Summary
.unique()
: get the unique elements from the column..nunique()
: count the unique elements.set(df['HomeTeam'])
: convert DF’s column to Set.
Problem 2: How many matches Liverpool win this season?
Algorithm : the FTR
column, Full Time Result, indicates which team won the match — H
, D
, or A
. Thus when Liverpool played as a home team, we have to filter only rows with FTR=H
, and when it played as an away team, we have to filter only rows that FTR=A
. Then we count the number of rows of both scenarios to get the final answer.
In practice, we find the results of win games of Liverpool as a home team first. Then to get the answer of the away-team part, it is just a matter of copy and paste with some minor modifications.
In coding, there are couple of ways to achieve this.
Solution 1
We can perform a series of filters. We first filter Liverpool
from HomeTeam
. Then we filter H
from FTR
. Then count.
To filter Liverpool from HomeTeam, we use
d21 = d1[d1['HomeTeam'] =='Liverpool']
(or d21 = d1[d1.HomeTeam=='Liverpool']
)
Then we can filter the d21
again with FTR=='H'
using the same syntax
d22 = d21[d21['FTR']=='H']
The last step is to count the result using len(d22)
.
To find all away wins, simply change HomeTeam
to AwayTeam
, and FTR=='H'
to FTR=='A'
.
Solution 2
In stead of applying 2 filters on different columns, we can also set on column as the index of the DF (the index is the header of the rows), then we select rows we want using the index label. This is basically another way to filter the data by row. Then, we will apply another filter on the other column. In this case, we will set HomeTeam
as an index, filter the HomeTeam
using index selection method, then filter the FTR
column to get the answer.
In the code shown below, we first create a new DF named d23
which derived from d1
with only 2 columns, HomeTeam
and FTR
. I print out the d23 to show that the index is integers (0, 1, 2, …).
Then we set the HomeTeam
as an index of the d23
using d23.set_index()
. The inplace=True
option applies the new index setting to the d23
itself. You can see from the result that the HomeTeam
becomes the index, replacing the integer 1,2,3, …
We can use .loc()
to refer to specific rows of the DF using the index name (i.e., filter the rows using the index). Thus, with the HomeTeam
as the index, we can simply locate the index=Liverpool by d23.loc['Liverpool']
.
To select the FTR=='H'
, we can just use the same filtering syntax as in Solution 1 above. Finally we can use len()
or .count()
to get the final answer.
As shown in the code below, the FTR=='H'
is applied first, followed by the .loc[]
to filter Liverpool from the HomeTeam
index.
Pandas Summary
df[df.HomeTeam=='Liverpool']
: filter data from a specific column.df.set_index('HomeTeam', inplace=True)
: set a specific column as a new index and apply to the DF.df.loc['Liverpool']
: select (filter) rows by specifying index name.
Problem 3: Count the number of HOME wins of all teams
This problem will introduce you to a new Pandas method: .groupby()
. The algorithm is quite simple — select the HomeTeam
with FTR=H
, then groupby
team name, and then count
.
To demonstrate how groupby()
works, I split the code into d31
, d32
, d33
, and d34
. The actual answer to this question requires only the d31
and d34
.
As shown in the code below, d31
derives from d1
with 2 columns, HomeTeam
and FTR
, and with FTR=='H'
filter. d32
creates a groupby
of d31
using HomeTeam
. Then selecting only the first element of each group using .first()
. I print out d31
and d32
to show how groupby()
affects the output. One easily unnoticed impact of groupby()
is that the column that we use for the groupby()
, i.e., HomeTeam
in our case, becomes the index of the output.
Let’s explain a bit more about how groupby()
works. The d32
groups d31
by HomeTeam
. Thus the FTR
of the same team are grouped together. The groupby()
method requires 3 stages: split, apply, and combine. The groupby()
command alone only split DF into groups and, at this stage, you cannot directly print out the group elements using print()
or display()
. We need an ‘apply’ method to process and combine the data of each group, and produce final output. Examples of apply methods are count()
, first()
, sum()
, or mean()
. We use .first()
in d32
to obtain the first element of each group just to demonstrate the groupby()
process. Again, notice that the HomeTeam
column becomes the index after applying groupby()
method.
From the code above, the d33
is also just for demonstration purpose only. Here the .get_group()
is used as an apply function to take a look at elements of the group. Notice that the integer index is not replaced by the HomeTeam
. The get_group()
only construct DF to display the elements of the named group but has not apply any significant functions to the elements.
The d34
applies .count()
to the groupby()
, which counts the elements in each group. The HomeTeam
becomes the index of the DF. This is the final answer for this problem.
Pandas Summary
d31.groupby('HomeTeam')
: groups the DF by the specific columnd31.groupby('HomeTeam').first()
:groupby()
then obtain the first elements of each group.d31.groupby('HomeTeam').get_group('Liverpool')
:groupby()
then create a DF that contains only elements of the specific group.d31.groupby('HomeTeam').count()
:groupby()
then counts elements of each group.
Problem 4: Compute total SCORE of all teams
A team gets 3 points if it wins a match, 1 if draws, and 0 if loose. The total score of the season of each team is a summation of these points of all matches.
This problem will introduce you to some new Pandas tricks: create a new DF from Dictionary, compose an IF-ELSE statements, and use the .append() to combine two DFs.
Algorithms: First we will create a new DF with team name and points columns. The points column is a new column that does not exist in the original dataset (does not exist in the .csv file). The point computations will be separated into 2 parts, home matches and away matches since the FTR is computed differently. Specifically, on home matches, the FTR
and points are computed as H=3
, D=1
, A=0
, while on the away matches, they are H=0
, D=1
, A=3
. When we are done with the computation of both parts, we combine both DF together and sum()
the points.
The first part of codes and outputs are shown below. This parts create new DFs and convert FTR
of {H
, A
, D
} to Points {0
, 1
, 3
}, according to the point computation of home team and away team.
Let’s explain the code in more details. The pd.DataFrame()
creates a new DF. The new DF is created using Dictionary
data structure, as {column_name : element_list}
. In our case, the Team
column is taken straightforwardly from d1. HomeTeam
. The Points
column is a list of either 3 or 1 or 0, according to the IF-ELSE condition based on the FTR
values.
The IF-ELSE
condition begins by taking elements from d1.FTR
(as specified at the very end of the clause). Then put each of the element through the IF
statement (‘H
’ or ‘D
’ or something else (which can possibly be only ‘A
’ here)). An integer of either 3
, 1
, or 0
is returned as a result for each of the element of d1.FTR
.
The code below combines d41
and d42
using .append()
, then performs groupby()
by the Team
column, and applies the sum()
function.
.append()
combines 2 DFs together by simply appending the rows of one DF at the end of the other. For a smooth append task, both DFs must have the same number of columns and the same column names.
Applying sum()
method to groupby()
add all points in the group together. If you want to see what happen before the sum()
method, you can try the .get_group()
like this:
d43 = d42.append(d41).groupby(‘Tam’).get_group(‘Liverpool’)
display(d43)
Pandas Summary
pd.DataFrame()
: create a new DF using Dictionary data structure.[3 if i=='A' else 1 if i=='D' else 0 for i in d1.FTR]}
: example of applyingIF-ELSE
statements with elements ofdf.column
.d43 = d42.append(d41)
: appends all rows of one DF to another DF.
Problem 5: Count the WIN matches only with HTR score lead (only wins that has score lead at half time)
There is a HTR
field (Half Time Result) that specifies the result at half time, H
, D
, or A
. This problem extends the problem 3 by adding another condition, HTR
must be win, on top of the FTR
must be win.
Algorithm: This problem actually combines problem 2 and 3 together and, thus comprising 2 main parts: filtering and grouping-by. The filtering part is similar to the solutions of the problem 2:
- filter the result twice using the comparison clause (the
==
clause) just like the solution 1 of the problem 2. Or - set one of the condition (
FTR
orHTR
) as index then use.loc[]
along with the==
clause, just like the solution 2 of problem 2.
The grouping-by part is similar to problem 3. I put the code of the home-team scenario below. I leave the away-team scenario as an exercise for you.
In the code above I have added some new tricks including
- select the column right within the
groupby()
statement. Specifically, thegroupby()
statement in my code looks like this:
d53 = d52.groupby('HomeTeam')['FTR'].count()
The d52
has 4 columns before being grouped-by, HomeTeam
, AwayTeam
, FTR
, and HTR
. However we want only 1 column after the .count()
. To select specific column(s) of the result of the groupby()
statement (i.e., d53
), you can just simply specify the column names after the groupby()
clause. Here I select FTR
. If you are not quite sure what I am talking about here, try take out the ['FTR']
from the statement above and observe the result by yourself.
- sort the results of the
count()
method using.sort_values()
. If the resulting DF has more than 1 column, you would needby='column_name'
as another parameter of the.sort_value()
, e.g.,.sort_values(by=’FTR’, ascending=False)
.
Pandas Summary
d53 = d52.groupby('HomeTeam')['FTR'].count()
: group byHomeTeam
andcount
. The result will haveHomeTeam
as an index and 1 column namedFTR
..sort_values(by=’FTR’, ascending=False)
: sort the input byFTR
column from the greatest value to the least. If only there is 1 column, theby=
parameter must be removed.
Problem 6: Create a Home-Win Table
Show a home-win table of all team that looks like this:
Algorithm: This problem add some new tricks to manipulate the groupby()
method. Rather than performing sum()
or count()
, we append or join the data in the group together to create a nice-looking table. Before being grouped-by, we just select HomeTeam
and FTR
columns, and replace the D
and A
letters with the dash -
.
Here is the code.
From the code, the .replace()
simply finds and replaces the text specified in the entire DF (all rows of all columns). So pick the search term carefully or the changes might be done on where they are not intended to.
The .apply(' '.join)
joins the data in the group, FTR of each team in this case, together with a space in between.
Pandas Summary
.replace()
: search entire DF for text pattern and replace with the specified text..apply(' '.join)
: concatenate the data of the group (of thegroupby()
function)
That’s it for this tutorial. Hope it helps. Peace.