A Beginners Application of Python with Bit(s) of Git++

Published in

GreyAtom

12 min readSep 29, 2018

All files and codes can be downloaded from here.

In this post, I will introduce the basic applications of Python along with bits of Git(++Hub). We will take one csv file with a few columns and apply the most common features and functionalities to analyze and interpret the data. This post will be more of a practical application & revision of basics of Git(+Hub) & Python, rather than a firsthand conceptual coverage. For a deep-dive into individual functionalities, I will share a few external references, wherever applicable. For the courageous, documentation to start of course! Inside medium, you just need to use the search button on top-right corner for any topic. There will be some interesting problems associated with each module, I guess it will be more fun to take these small challenges & solve them first.

I am using OS X 10.13 for this post. Even though there may not be negligible difference in commands, you may occasionally like to google some terminal commands relevant to your OS. We will use jupyter notebook for Python to have everything standardized. This blogpost consists of the following parts:

Git & GitHub

Overview: What(s), Why(s) & Where(s)
The Hows of it
Further reading

2. Numpy

Introduction
The Hows of Numpy: IPLM1 dataset
And a bit of visualization with Matplotlib
Further references

3. Pandas

Introduction
The Hows of Pandas: IPLM1 dataset
+ Visualization on Series/DataFrames
Further references

1.1 Git & GitHub

Git is a modern widely-used distributed version control system i.e every user’s working copy has a complete history of changes
It’s mature and an open-source project, developed in 2005 by Linus Torvalds (the creator of Linux)
You can download & install Git from here
Git doesn’t get confused by the same file names and focuses on the file content itself!
Git uses a combination of delta encoding (for differences in content), compression and explicitly stores directory contents and version metadata objects. Don’t worry if you don’t understand much of this.
The content of the files as well as the mapping between files and directories, versions, tags and commits, are secured with a cryptographically secure hashing algorithm called SHA1
GitHub is ‘the social network’ for programmers. Sign-up today (if not already) here.
GitHub (the Octocat) a public forum by default which gives others a view of what you are working upon. One can easily use other’s code or push changes or suggestions. This really lowers the barriers of collaboration, since many solutions are already available.

1.2 The Hows of Git & GitHub

Pre-requisite: You have installed Git and signed up for GitHub
Objective(s):

Git: To record a file that contains a short ball-by-ball summary of the first ever IPL (Indian Premier League) cricket match.
This file will be used for later examples. You would have downloaded it from the link shared at top. You are also free use any other file for Git (however for Python we will refer to IPL csv file)
GitHub: Create a repository
Upload the changes on GitHub
Git: Initialize a Readme file to give its content shape & form (in a branch)
Merge the changes to master
Upload the changes on GitHub

Steps

Open terminal or cmd (in windows)
Navigate to working directory: cd Desktop --> cd <Folder Name> (Folder is MLIntro for me)
Initialize Git: git init

Initialized empty Git repository in ../Desktop/ML/MLintro/.git/

Check Git status: git status

A list of untracked files preferably IPLM1.csv will be shown by Git

Ask Git to track IPLM1.csv file & stage it: git add IPLM1.csv (Git just remains silent, unless there is an error). You can find more on staging here.
Check Git status again: git status

Output

On branch master
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: IPLM1.csv

Let us commit the file to local repository (with a comment citing reason behind the change): git commit -m "IPL Match 1 added to local repo"
Output will show something as follows with an address like ‘e0ff511’:

[master (root-commit) e0ff511] IPL Match 1 added to local repo
1 file changed, 226 insertions(+)
create mode 100644 IPLM1.csv

Go to GitHub → Your repositories
Create a new repo(the green button to the right) with a name say “ML101” (let’s keep it in right proportions to our aspiration)
Add a small description say “Practice basics of Git and Python”
Do NOT initialize a README, we shall do it later!
Click Create repository (another green button at the bottom)
Now once you create the repo, it will show all sorts of codes to make sure you mirror your local repo into GitHub (Smart :), Thanks Linus and the thousands of Open Source contributors who took care of us!)
Since we have already created a local repository, we will focus on the pushing it to GitHub
To check if remote already exists, enter in terminal/cmd: git remote -v
Then add a remote by the command given in GitHub: git remote add origin https://github.com/sannidh/ML101.git

It will add 2 remotes: fetch and push (You can check by using git remote -v again)

Now lets push the same to GitHub : git push -u origin master

Kudos! We are done pushing the repo to GitHub. Refresh your GitHub and you can see that our file has been added!

Master is the default branch. Branches are useful for code changes or new features not ready to be included in the main code yet, but have the possibility of being merged later on.
Now let’s create a branch called new_feature. git branch new_feature

To check whether you are in the new_feature branch or master; type in: git branch
To navigate to new_feature: git checkout new_feature

Let’s create a README file in the new_feature branch by command: touch README.md

Open README.md file preferably in a text editor (Sublime Text works well for both Mac & Windows)

Enter text and save the readme file.
## This is the header for README.md file
Here are some bullets with asterisks
* Bullet 1
* Bullet 2
[Link to Google] (http://www.google.co.in)
More text can be entered here

Add and commit the README.md file
git add README.md
git commit -m "Added README file to branch"
To check git logs type : git log -n 5 --oneline, which will show one liner logs for the last 5 log entries
If you checkout to master (git checkout master), the README file will disappear from the folder!
Let’s now merge this branch into master. We will need to navigate to master branch in which we are merging the branch:
git checkout master
git merge new_feature (from master)
To push the changes, use the command: git push -u origin master
This will push the README.md file to the same GitHub repo. It will also create a readme preview of the file in the repository. Do refresh the repository to see the changes!

1.3 Further Reading: Git & GitHub

Pro Git book, written by Scott Chacon and Ben Straub — Book
Introduction to Git & GitHub — Video Series
Git Quick Reference — Post

2.0 Numpy

Numpy is the core library for scientific computing in Python
The object type is a a multidimensional array called ndarray, numpy has functions for working with this ndarray
An array is just a contiguous block of memory where every element has the same type and layout, for a CPU
These elements can be dynamically transformed into different (acceptable) data-types for computation

Pre-requisites

You have installed Anaconda (download)
Step-by-step guide
Numpy is commonly imported as np.
If you are new to Numpy, datacamp has an awesome hands-on introduction

2.1 Matplotlib

Most popular visualization library in Python
Matplotlib can help make line plots, histograms, power spectra, bar charts, errorcharts, scatterplots among others
A pyplot submodule is imported from matplotlib commonly as plt

2.2 Working with Numpy

Let’s upload the IPLM1.csv file into Anaconda’s working directory. You can work from local too. Download IPLM1.csv from here. I find it easier when the file is uploaded in Anaconda. It is internally managed in the system user/Documents folder.
On jupyter notebook’s home → Navigate inside Documents folder. Click the Upload button on top right corner. Browse and select the IPLM1.csv file and click on the blue Upload button once more
Yay! Let’s start the magic by importing Numeric Python in a new Python 3 notebook.
Open Anaconda → Click on jupyter notebook → New → Python 3
Command: import numpy as np

Objectives (Basics):

Create a array containing only number 1, of shape (3,4,2) (rows, columns, layers) say x
Create random integer array of same (4,2,3) say y
Reshape the shape of y to (3,4,2) (Total elements for y = 4*3*2 = 24 = 3*4*2 should match)
Get x+y in z
Transpose z
Print some common properties of z

Solution (Basics):

Create & Check x:

2. Create & reshape y

reshape numpy array

3. Get z and let’s print some properties & perform operations on z

Objective(s) #Operations on csv file :

Import csv file into a numpy ndarray by using genfromtxt function
Count the number of unique batsmen who faced a ball in the 1st innings (innings=1)
Find the sum of extras bowled in the 2nd innings (innings=2)
Find the count of sixes in the first innings
Find the total number of deliveries faced by BB McCullum and calculate his strike rate (=sum of runs/count of delivery * 100%)

Solution(s) #Operations on csv file:

Import IPLM1.csv :

2. Count number of unique batsmen who faced a ball, i.e strikers in first innings

Let’s find the column of the numpy array which points to the innings of the IPL match. It’s column 1, (remember it’s indexed n-1) called innings! Now, remember we imported all elements as ‘|S50’ or string type. To pass on a condition to the array for the 1st innings, we need to convert innings into type integer. This conversion done deffly by numpy function, astype(np.int64). Now your task is find for the other datatypes!
This gives takes us to pass the condition to 0th column with value = 1 i.e ipl1_array[:,0].astype(np.int64)==1
Now, the rest of the steps are logically simple. We need to take a subset of the ipl1_array which returns the batsman of innings=1, a set() function would retrieve the unique values from the array. Subsetting the array first gives us the names of all the batsman, ipl1_array[ipl1_array[:,0].astype(np.int64)==1][:,3]
Then a set function would retrieve only the unique values. The length of the set would give us the count.
unique_batsmen=set(ipl1_array[ipl1_array[:,0].astype(np.int64)==1][:,3]) # returns a subset of unique batsmen who batted in 1st innings
print(len(unique_batsmen))

3. Find the sum of extras bowled in the 2nd innings (innings=2)

extras_inn2=ipl1_array[ipl1_array[:,0].astype(np.int64)==2][:,7].astype(np.int64)
There is a function called sum in numpy, which will add all the elements print(np.sum(extras_inn2))

4. Find the count of sixes in the first innings

Here we will need to pass two conditions for subsetting, innings=1 and runs=6. This can be used by using the & operator along with conditions within curly brackets for subsetting.

Our good friend len will give the count

sixes_inn1=ipl1_array[(ipl1_array[:,0].astype(np.int64)==1) & (ipl1_array[:,6].astype(np.int64)==6)][:,6].astype(np.int64)
print(len(sixes_inn1))
print(sixes_inn1.size) will also return the number of elements, since sixes_inn1 is an array

5. Find the total number of deliveries faced by BB McCullum and calculate his strike rate

Let’s subset the deliveries faced by McCullum in an array called deliv_bbm. We have to pass the condition of astype(np.str) for string to compare it with the value of ‘BB McCullum’.You might recall that the original imported string is of length 50.
deliv_bbm=ipl1_array[ipl1_array[:,3].astype(np.str)=='BB McCullum'][:,1].astype(np.float64)
Similarly let’s get the runs array as a float value.
runs_bbm=ipl1_array[ipl1_array[:,3].astype(np.str)=='BB McCullum'][:,6].astype(np.float64)
strike_rate_bbm=np.sum(runs_bbm)/len(deliv_bbm)*100
print(strike_rate_bbm)

Objective (plot): Draw and download into local — the frequency histogram of runs scored by BB McCullum in the first innings

Solution (plot): We need to first import the module which has the functionality of plotting a histogram.

Histogram of runs scored by BB McCullum

2.3 Further Reading: Numpy & Matplotlib

Numpy@datacamp
Thinking in Arrays — here
Matplotlib@datacamp

3.0 Pandas

Built on numpy library, pandas provides a fast and efficient way to handle heterogeneous columns of the same dataset
Intuitive and easy to use for data analysis & manipulation
It can handle ordered and unordered, labelled or non-labelled data
Built-in I/O methods with efficient handlers for many filetypes (csv, xlsx, databases)
Pandas has built in methods to handle missing data
Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
Powerful functionalities: groupby, merge, pivoting, time-series handlers among many others
If you are new to Pandas, do spend 10–15 mins here

3.1 The Hows of Pandas: IPLM1 dataset

Let’s upload the same IPLM1.csv file into Anaconda’s working directory, if not done already. You can work from local too. Download IPLM1.csv from here. It is internally managed in the system user/Documents folder.
On jupyter notebook’s home → Navigate inside Documents folder. Click the Upload button on top right corner. Browse and select the IPLM1.csv file and click on the blue Upload button once more
Let’s start the magic by importing Numpy & Pandas in a new Python 3 notebook.
Open Anaconda → Click on jupyter notebook → New → Python 3
Command: import numpy as np
import pandas as pd
Let’s start by importing the csv file onto a DataFrame:
ipl_df=pd.read_csv('Documents/IPLM1.csv')
ipl_df.head() #Will return the first five rows by default, any number ex. 10 inside it will return that many rows (Rows indexed 0-9)
There is an indexing, alas! again from 0 in the first column, and for those of you who have worked in excel it may feel like one of the default table styles!
Some common properties
ipl_df.info() #to know the datatypes in column
ipl_df.describe() #run summary statistics on numerical columns
ipl_df.rename(columns={'non_striker':'runner'}) #rename a column from one name to another, adding another argument, inplace='True' will modify the DataFrame itself

Objective(s) #Operations on the DataFrame:

Find the count of batsmen who have faced a delivery
Plot of deliveries faced by each unique batsman
Find the names of batsmen who were ‘run out’ during the match
Find the frequency of ‘runs’ for example how many sixes, fours, twos & ones
Get first 10 deliveries for the 2nd innings of the match
Find out team-wise total runs
Pie-chart of total runs scored by each batsman in the first innings
Box plot of runs scored by RCB (Royal Challengers Bangalore)

Solution(s) #Operations on the DataFrame:

We need to take an array out of the batsman column. Subsetting is pretty similar to numpy. There are multiple methods to retrieve the unique ‘batsman’ per se. Two functions are there to help with the values & counts.

unique() for getting the values and nunique() to get the counts of unique values
We can take an array output of the unique batsman by: ipl_df['batsman'].unique() and then take a len() of the array.
We can also use the loc() function to index in an array like manner: ipl_df.loc[:,'batsman'].nunique()
If we wish to have a frequency of occurrence the function value_counts, helps get it into a pandas series, series is per se a list with internal indexing: print(ipl_df.loc[:,'batsman'].value_counts())

2. Now pandas is quite powerful for visualization, since data is already present in a DataFrame. The way we call matplotlib in pandas is much more straightforward. Let’s try to make the frequency graph with different batsmen. Kindly note that frequency is a misnomer and it’s merely a bar chart. So let’s do that.

3. Let’ now find the batsmen who were ‘run out’ during the match. This will require a condition being passed onto the pandas DataFrame. From ipl_df.head(2) it seems we need to filter by out_type & get the out_batsman.

We need to pass the condition to the DataFrame: ipl_df['out_type']=='run out']
And retrive only the batsman names: ipl_df['out_batsman'][ipl_df['out_type']=='run out']
The output will be something like this:
Name: out_batsman, dtype: object
197 AA Noffke

4. For the frequency of runs overall, it would be probably nifty to use the value_counts() function again: ipl_df['runs'].value_counts()

5. For first 10 deliveries for 2nd innings, we will need to subset the same, with a innings condition & take a head out of it for the first 10 rows: ipl_df[ipl_df['innings']==2].head(10)

6. To find the runs, we would use an inbuilt sum function passed on to the more famous groupby function: Now we need to group runs column by batting team while getting the sum of runs column.

7. Pie-chart of runs scored by batsmen in 2nd innings. It should be easy, we just need to pass an innings condition to our previous graph

8. Let try the box plot for runs scored by RCB in the 2nd innings

ipl_df[ipl_df['innings']==2]['runs'].plot(kind='box')

3.3 Further Reading: Pandas

Pandas@ Datacamp
10 minutes to pandas

Hope you had fun with the IPL dataset. We will cover more of visualization in the next post.

Written by Sonik Mishra