Data Analytics using Pandas: A Quick Practical Tutorial 1

LC
LuckSpark
Published in
14 min readAug 25, 2018

Pandas is a magnificent tool that empowers python enormously for data analytics. It is able to read and transform structured data in tons of ways. This article will take you through some practical Pandas data transformation using English Premier League season 2014/2015 results as dataset.

This practical tutorial get you straight to the functions and methods of Pandas to manipulate data and DataFrames without much of syntax and option explanations. I tried to focus on how Pandas functions work for us to get the answer we need. I leave the readers to find out a huge arrays of options and varieties of other syntax patterns from other sources by themselves.

I summarize Pandas methods/functions learnt (and what they do) for each section in Pandas Summary subsection.

The Dataset

This is the direct link to download .csv file of English Premier League season 2014/2015 results used in this tutorial.

The explanation of column names and abbreviations can be read here. Some abbreviations which will be used in this tutorial are:

Figure 0: Some abbreviation descriptions of the E0.csv file. Only these columns are used in this tutorial.

Coding Environments

I simply installed Anaconda 5.2 for MacOS High Sierra (64-bit with Python 3.6) on my system. I use Jupyter Notebook as my coding IDE.

I put the E0.csv under datafile folder, as shown below.

Figure 1: File locations viewed from Jupyter Notebook.
Figure 2: The E0.csv file is in the ./datafile folder.

Data Preparations

Loading The E0.csv file

The first step is to create a DataFrame (DF) by importing the .csv file. Simply use pd.read_csv(). In the code below, the first DF created from the E0.csv file is named d0.

Figure 3: Import Pandas and create DF by reading from a .csv file. Display the table on Jupyter notebook.

The display() is Jupyter Notebook’s feature to magically display the table in very eye-pleasing way. Using print() is ok, but display() is more beautiful on Notebook.

At the end of the screen shown below, the size of the table is also shown — 381 rows × 68 columns. The last line showing 381 is the result of the print(len(d0)) command.

Figure 4: Size of the DF and output of len(d0).

Pandas summary:

  • pd.read_csv() : to create a DF by importing a csv file
  • len(d0) : to count the rows of the DF

Display only the First or the Last few Rows

From the display() command above, although some rows in the middles are skipped, it is still a long output. We can choose to display only the first or the last few rows using .head() and .tail() methods. You can also specify the number of rows you want to display by entering the number of rows as a parameter, e.g., head(10) or tail(10).

Figure 5: The .head() method displays the first few rows of the DF.
Figure 6: The tail() method displays the last few rows of the DF.

Pandas summary:

  • .head() and .tail() : to view only the first few or the last few rows of DF.

Checking for Missing Data

In the image above, showing the result of tail() method, you might notice that the last line of the data shows380 NaN NaN ... NaN. The NaN indicates the missing of data. Let’s examine the file by opening up the E0.csv file using Jupyter Notebook (or on any text editor, e.g., Sublime), which looks like this:

Figure 7: E0.csv screen shown on Jupyter Notebook.

When you scroll down to the very end, you probably see this.

Figure 8: The last lines of the E0.csv file.

The line 382 contains only an array of commas with out data in between. Therefore, the line 382 causes the NaN to show up when we load the file using Pandas.

Further examination on the missing of data can be done using df.info() command. The df.info() shows data types and, more importantly, the number of non-null rows (the number rows with valid data) of each column.

Figure 9: df.info() shows the name, number of non-null rows, and data types of each column.

When scrolling down slightly and looking carefully, it can be seen that there are 3 unusual columns, namely SJH, SJD, and SJA, with only 40 non-null value (instead of 380 as in other columns).

Figure 10: Notice the 40 non-null columns: SJH, SJD, and SJA. This indicates that these 3 columns has 40 non-null rows of data, and probably 380–40 = 340 rows of missing data.

Let’s take a look at the raw E0.csv file to see the root cause of the null data. In the file you will see some missing data in some columns, the consecutive commas, as shown in the image below. Thus, these are the missing data of the columns named SJH, SJD, and SJA summarized by the df.info() above.

Figure 11: The missing data viewed thought the raw text editor.

Pandas summary:

  • df.info() : shows the name, number of non-null rows, and data types of each column.

Discarding (Dropping) Missing Data

Our objective here is to remove the last line (line 380 of Figure 6) containing the missing data. To discard the missing data, we can use df.dropna(how=’any’) . This removes all rows that contain the NaN.

However, dropping missing data can be trickier than you think. If we apply the .dropna() to the entire data set (the d0 here), any rows with the missing data caused by the SJH, SJD, and SJA columns will be removed too. The 380 rows of all columns will be shortened down to only 40 rows, which certainly is NOT what we want.

Figure 12: d0.dropna() will remove all rows with missing data, ending up with only 40 rows left.

In our tutorial here, we will use only first few columns and not the SJH, SJA, and SJD. Therefore, we shall select the columns we want to use first, which excludes the SJH, SJA, and SJD columns, then discarding the NaN rows later.

Pandas summary:

  • df.dropna(how=’any’) : removes all rows that contain the NaN.

Selecting Column(s) from DF

To select ONE column from a DF, we can use df[‘column name’] just like selecting an element from a List in Python. You can also use df.column_name format. The first is easier to remember as it is the same as List element selection syntax. It also supports column name with spaces. The latter is slightly easier to write.

The column selection will not affect the original DF (d0). We have to return the output to a new DF (d1). The d1 = d0['HomeTeam'] means creating a new DF named d1 by selecting all rows of the column named 'HomeTeam' from the DF d0.

Figure 13: Create a new DF, d1, from HomeTeam column of d0.

To select several columns, use double brackets — df[['name1','name2']]

Figure 14: Selecting multiple columns of a DF.

The code below combines the column selection and dropna() in one line. Notice that the line 380 with NaN is removed.

Figure 15: Combine the selection of multiple columns and the dropna( ) in one line.
Figure 16: The last rows of Figure 15 shows that the NaN row (row 380 or Figure 6) is removed.

Pandas Summary

  • df['column name'] or df.column_name : select only a specific column of the DF.
  • d0[['Date','HomeTeam','AwayTeam','FTR']] : select specific columns of the DF.

Data Analytics

Problem 1 : Get the name of all teams played this season

Put simply, get the unique names of all teams. We all know that there must be exactly 20 team names. An algorithm to solve this problem is simple: get a unique team names of either HomeTeam or AwayTeam column.

There are few ways of doing this.

First, use .unique(). Thus print(d1[‘HomeTeam’].unique()) will give you the answer. Of course, print(d1.HomeTeam.unique()) will also work. The output is of type numpy.ndarray. To ensure correctness, we can use .nunique() to count the result.

Figure 17: Get a list of unique elements of a column using df[‘column_name’].unique( ). Use .nunique() to count.

Second, exploit the uniqueness property of Python’s Set. Throw the HomeTeam names into a set and it will remove duplicates automatically for you. Thus, print(set(d1[‘HomeTeam’])) will also give you the answer. The output is a set. To count the elements of the Set, we can use len(), as shown below.

Figure 18: Get a list of unique elements of a column using Set.

Pandas Summary

  • .unique(): get the unique elements from the column.
  • .nunique() : count the unique elements.
  • set(df['HomeTeam']) : convert DF’s column to Set.

Problem 2: How many matches Liverpool win this season?

Algorithm : the FTR column, Full Time Result, indicates which team won the match — H, D, or A. Thus when Liverpool played as a home team, we have to filter only rows with FTR=H, and when it played as an away team, we have to filter only rows that FTR=A. Then we count the number of rows of both scenarios to get the final answer.

In practice, we find the results of win games of Liverpool as a home team first. Then to get the answer of the away-team part, it is just a matter of copy and paste with some minor modifications.

In coding, there are couple of ways to achieve this.

Solution 1

We can perform a series of filters. We first filter Liverpool from HomeTeam. Then we filter H from FTR. Then count.

To filter Liverpool from HomeTeam, we use

d21 = d1[d1['HomeTeam'] =='Liverpool']

(or d21 = d1[d1.HomeTeam=='Liverpool'])

Then we can filter the d21 again with FTR=='H' using the same syntax

d22 = d21[d21['FTR']=='H']

The last step is to count the result using len(d22).

Figure 19: All home wins of Liverpool this season.

To find all away wins, simply change HomeTeam to AwayTeam, and FTR=='H' to FTR=='A'.

Solution 2

In stead of applying 2 filters on different columns, we can also set on column as the index of the DF (the index is the header of the rows), then we select rows we want using the index label. This is basically another way to filter the data by row. Then, we will apply another filter on the other column. In this case, we will set HomeTeam as an index, filter the HomeTeam using index selection method, then filter the FTR column to get the answer.

In the code shown below, we first create a new DF named d23 which derived from d1 with only 2 columns, HomeTeam and FTR. I print out the d23 to show that the index is integers (0, 1, 2, …).

Then we set the HomeTeam as an index of the d23 using d23.set_index(). The inplace=True option applies the new index setting to the d23 itself. You can see from the result that the HomeTeam becomes the index, replacing the integer 1,2,3, …

Figure 20: using .set_index() to set HomeTeam as index of the DF

We can use .loc() to refer to specific rows of the DF using the index name (i.e., filter the rows using the index). Thus, with the HomeTeam as the index, we can simply locate the index=Liverpool by d23.loc['Liverpool'].

To select the FTR=='H', we can just use the same filtering syntax as in Solution 1 above. Finally we can use len() or .count() to get the final answer.

As shown in the code below, the FTR=='H' is applied first, followed by the .loc[] to filter Liverpool from the HomeTeam index.

Figure 21: Select FTR=H, then use df.loc[‘Liverpool’] to select only the rows with index = Liverpool.

Pandas Summary

  • df[df.HomeTeam=='Liverpool']: filter data from a specific column.
  • df.set_index('HomeTeam', inplace=True): set a specific column as a new index and apply to the DF.
  • df.loc['Liverpool']: select (filter) rows by specifying index name.

Problem 3: Count the number of HOME wins of all teams

This problem will introduce you to a new Pandas method: .groupby(). The algorithm is quite simple — select the HomeTeam with FTR=H, then groupby team name, and then count.

To demonstrate how groupby() works, I split the code into d31, d32, d33, and d34. The actual answer to this question requires only the d31 and d34.

As shown in the code below, d31 derives from d1 with 2 columns, HomeTeam and FTR, and with FTR=='H' filter. d32 creates a groupby of d31 using HomeTeam. Then selecting only the first element of each group using .first(). I print out d31 and d32 to show how groupby() affects the output. One easily unnoticed impact of groupby() is that the column that we use for the groupby(), i.e., HomeTeam in our case, becomes the index of the output.

Figure 22: d31 derived from d1, then being grouped-by as d32. Notice that ‘HomeTeam’ becomes index of d32.

Let’s explain a bit more about how groupby() works. The d32 groups d31 by HomeTeam. Thus the FTR of the same team are grouped together. The groupby() method requires 3 stages: split, apply, and combine. The groupby() command alone only split DF into groups and, at this stage, you cannot directly print out the group elements using print() or display(). We need an ‘apply’ method to process and combine the data of each group, and produce final output. Examples of apply methods are count(), first(), sum(), or mean(). We use .first() in d32 to obtain the first element of each group just to demonstrate the groupby() process. Again, notice that the HomeTeam column becomes the index after applying groupby() method.

Figure 23: Demonstration of get_group() and count() methods being applied to groupby().

From the code above, the d33 is also just for demonstration purpose only. Here the .get_group() is used as an apply function to take a look at elements of the group. Notice that the integer index is not replaced by the HomeTeam. The get_group() only construct DF to display the elements of the named group but has not apply any significant functions to the elements.

The d34 applies .count() to the groupby(), which counts the elements in each group. The HomeTeam becomes the index of the DF. This is the final answer for this problem.

Pandas Summary

  • d31.groupby('HomeTeam') : groups the DF by the specific column
  • d31.groupby('HomeTeam').first(): groupby() then obtain the first elements of each group.
  • d31.groupby('HomeTeam').get_group('Liverpool'): groupby() then create a DF that contains only elements of the specific group.
  • d31.groupby('HomeTeam').count() : groupby() then counts elements of each group.

Problem 4: Compute total SCORE of all teams

A team gets 3 points if it wins a match, 1 if draws, and 0 if loose. The total score of the season of each team is a summation of these points of all matches.

This problem will introduce you to some new Pandas tricks: create a new DF from Dictionary, compose an IF-ELSE statements, and use the .append() to combine two DFs.

Algorithms: First we will create a new DF with team name and points columns. The points column is a new column that does not exist in the original dataset (does not exist in the .csv file). The point computations will be separated into 2 parts, home matches and away matches since the FTR is computed differently. Specifically, on home matches, the FTR and points are computed as H=3, D=1, A=0, while on the away matches, they are H=0, D=1, A=3. When we are done with the computation of both parts, we combine both DF together and sum() the points.

The first part of codes and outputs are shown below. This parts create new DFs and convert FTR of {H, A, D} to Points {0, 1, 3}, according to the point computation of home team and away team.

Figure 24: d41 and d42 are new DFs that convert FTR result to Points.

Let’s explain the code in more details. The pd.DataFrame() creates a new DF. The new DF is created using Dictionary data structure, as {column_name : element_list}. In our case, the Team column is taken straightforwardly from d1. HomeTeam. The Points column is a list of either 3 or 1 or 0, according to the IF-ELSE condition based on the FTR values.

The IF-ELSE condition begins by taking elements from d1.FTR (as specified at the very end of the clause). Then put each of the element through the IF statement (‘H’ or ‘D’ or something else (which can possibly be only ‘A’ here)). An integer of either 3, 1, or 0 is returned as a result for each of the element of d1.FTR.

The code below combines d41 and d42using .append(), then performs groupby() by the Team column, and applies the sum() function.

Figure 25: d43 combines d41 and d42 together. Then groupby() and sum() the elements.

.append() combines 2 DFs together by simply appending the rows of one DF at the end of the other. For a smooth append task, both DFs must have the same number of columns and the same column names.

Applying sum() method to groupby() add all points in the group together. If you want to see what happen before the sum() method, you can try the .get_group() like this:

d43 = d42.append(d41).groupby(‘Tam’).get_group(‘Liverpool’)
display(d43)

Pandas Summary

  • pd.DataFrame(): create a new DF using Dictionary data structure.
  • [3 if i=='A' else 1 if i=='D' else 0 for i in d1.FTR]}: example of applying IF-ELSE statements with elements of df.column.
  • d43 = d42.append(d41): appends all rows of one DF to another DF.

Problem 5: Count the WIN matches only with HTR score lead (only wins that has score lead at half time)

There is a HTR field (Half Time Result) that specifies the result at half time, H, D, or A. This problem extends the problem 3 by adding another condition, HTR must be win, on top of the FTR must be win.

Algorithm: This problem actually combines problem 2 and 3 together and, thus comprising 2 main parts: filtering and grouping-by. The filtering part is similar to the solutions of the problem 2:

  1. filter the result twice using the comparison clause (the == clause) just like the solution 1 of the problem 2. Or
  2. set one of the condition (FTR or HTR) as index then use .loc[] along with the == clause, just like the solution 2 of problem 2.

The grouping-by part is similar to problem 3. I put the code of the home-team scenario below. I leave the away-team scenario as an exercise for you.

Figure 26: Solution of problem 5 using set_index()
Figure 27: Solution of problem 5 using the comparison clause twice.

In the code above I have added some new tricks including

  • select the column right within the groupby() statement. Specifically, the groupby() statement in my code looks like this:

d53 = d52.groupby('HomeTeam')['FTR'].count()

The d52 has 4 columns before being grouped-by, HomeTeam, AwayTeam, FTR, and HTR. However we want only 1 column after the .count(). To select specific column(s) of the result of the groupby() statement (i.e., d53), you can just simply specify the column names after the groupby() clause. Here I select FTR. If you are not quite sure what I am talking about here, try take out the ['FTR'] from the statement above and observe the result by yourself.

  • sort the results of the count() method using .sort_values(). If the resulting DF has more than 1 column, you would need by='column_name' as another parameter of the .sort_value() , e.g., .sort_values(by=’FTR’, ascending=False).

Pandas Summary

  • d53 = d52.groupby('HomeTeam')['FTR'].count(): group by HomeTeam and count. The result will have HomeTeam as an index and 1 column named FTR.
  • .sort_values(by=’FTR’, ascending=False): sort the input by FTR column from the greatest value to the least. If only there is 1 column, the by= parameter must be removed.

Problem 6: Create a Home-Win Table

Show a home-win table of all team that looks like this:

Figure 28: A Home-Win table.

Algorithm: This problem add some new tricks to manipulate the groupby() method. Rather than performing sum() or count(), we append or join the data in the group together to create a nice-looking table. Before being grouped-by, we just select HomeTeam and FTR columns, and replace the D and A letters with the dash -.

Here is the code.

Figure 29: The code for Home-Win table problem.

From the code, the .replace() simply finds and replaces the text specified in the entire DF (all rows of all columns). So pick the search term carefully or the changes might be done on where they are not intended to.

The .apply(' '.join) joins the data in the group, FTR of each team in this case, together with a space in between.

Pandas Summary

  • .replace() : search entire DF for text pattern and replace with the specified text.
  • .apply(' '.join) : concatenate the data of the group (of the groupby() function)

That’s it for this tutorial. Hope it helps. Peace.

--

--