Sonik Mishra
Sep 29, 2018 · 12 min read

All files and codes can be downloaded from here.

In this post, I will introduce the basic applications of Python along with bits of Git(++Hub). We will take one csv file with a few columns and apply the most common features and functionalities to analyze and interpret the data. This post will be more of a practical application & revision of basics of Git(+Hub) & Python, rather than a firsthand conceptual coverage. For a deep-dive into individual functionalities, I will share a few external references, wherever applicable. For the courageous, documentation to start of course! Inside medium, you just need to use the search button on top-right corner for any topic. There will be some interesting problems associated with each module, I guess it will be more fun to take these small challenges & solve them first.

I am using OS X 10.13 for this post. Even though there may not be negligible difference in commands, you may occasionally like to google some terminal commands relevant to your OS. We will use jupyter notebook for Python to have everything standardized. This blogpost consists of the following parts:

  1. Git & GitHub
  • Overview: What(s), Why(s) & Where(s)
  • The Hows of it
  • Further reading

2. Numpy

  • Introduction
  • The Hows of Numpy: IPLM1 dataset
  • And a bit of visualization with Matplotlib
  • Further references

3. Pandas

  • Introduction
  • The Hows of Pandas: IPLM1 dataset
  • + Visualization on Series/DataFrames
  • Further references

1.1 Git & GitHub

  • Git is a modern widely-used distributed version control system i.e every user’s working copy has a complete history of changes
  • It’s mature and an open-source project, developed in 2005 by Linus Torvalds (the creator of Linux)
  • You can download & install Git from here
  • Git doesn’t get confused by the same file names and focuses on the file content itself!
  • Git uses a combination of delta encoding (for differences in content), compression and explicitly stores directory contents and version metadata objects. Don’t worry if you don’t understand much of this.
  • The content of the files as well as the mapping between files and directories, versions, tags and commits, are secured with a cryptographically secure hashing algorithm called SHA1
  • GitHub is ‘the social network’ for programmers. Sign-up today (if not already) here.
  • GitHub (the Octocat) a public forum by default which gives others a view of what you are working upon. One can easily use other’s code or push changes or suggestions. This really lowers the barriers of collaboration, since many solutions are already available.

1.2 The Hows of Git & GitHub

  1. Pre-requisite: You have installed Git and signed up for GitHub
  2. Objective(s):
  • Git: To record a file that contains a short ball-by-ball summary of the first ever IPL (Indian Premier League) cricket match.
  • This file will be used for later examples. You would have downloaded it from the link shared at top. You are also free use any other file for Git (however for Python we will refer to IPL csv file)
  • GitHub: Create a repository
  • Upload the changes on GitHub
  • Git: Initialize a Readme file to give its content shape & form (in a branch)
  • Merge the changes to master
  • Upload the changes on GitHub

Steps

  1. Open terminal or cmd (in windows)
  2. Navigate to working directory: cd Desktop --> cd <Folder Name> (Folder is MLIntro for me)
  3. Initialize Git: git init
  • Initialized empty Git repository in ../Desktop/ML/MLintro/.git/
  1. Check Git status: git status
  • A list of untracked files preferably IPLM1.csv will be shown by Git
  1. Ask Git to track IPLM1.csv file & stage it: git add IPLM1.csv (Git just remains silent, unless there is an error). You can find more on staging here.
  2. Check Git status again: git status

Output

  • On branch master
  • No commits yet
  • Changes to be committed:
  • (use "git rm --cached <file>..." to unstage)
  • new file: IPLM1.csv
  1. Let us commit the file to local repository (with a comment citing reason behind the change): git commit -m "IPL Match 1 added to local repo"
  2. Output will show something as follows with an address like ‘e0ff511’:
  • [master (root-commit) e0ff511] IPL Match 1 added to local repo
  • 1 file changed, 226 insertions(+)
  • create mode 100644 IPLM1.csv
  1. Go to GitHub → Your repositories
  2. Create a new repo(the green button to the right) with a name say “ML101” (let’s keep it in right proportions to our aspiration)
  3. Add a small description say “Practice basics of Git and Python
  4. Do NOT initialize a README, we shall do it later!
  5. Click Create repository (another green button at the bottom)
  6. Now once you create the repo, it will show all sorts of codes to make sure you mirror your local repo into GitHub (Smart :), Thanks Linus and the thousands of Open Source contributors who took care of us!)
  7. Since we have already created a local repository, we will focus on the pushing it to GitHub
  8. To check if remote already exists, enter in terminal/cmd: git remote -v
  9. Then add a remote by the command given in GitHub: git remote add origin https://github.com/sannidh/ML101.git
  • It will add 2 remotes: fetch and push (You can check by using git remote -v again)
  1. Now lets push the same to GitHub : git push -u origin master
  • Kudos! We are done pushing the repo to GitHub. Refresh your GitHub and you can see that our file has been added!
  1. Master is the default branch. Branches are useful for code changes or new features not ready to be included in the main code yet, but have the possibility of being merged later on.
  2. Now let’s create a branch called new_feature. git branch new_feature
  • To check whether you are in the new_feature branch or master; type in: git branch
  • To navigate to new_feature: git checkout new_feature
  1. Let’s create a README file in the new_feature branch by command: touch README.md
  • Open README.md file preferably in a text editor (Sublime Text works well for both Mac & Windows)

Enter text and save the readme file.

## This is the header for README.md file

Here are some bullets with asterisks

* Bullet 1

* Bullet 2

[Link to Google] (http://www.google.co.in)

More text can be entered here

  1. Add and commit the README.md file
  2. git add README.md
  3. git commit -m "Added README file to branch"
  4. To check git logs type : git log -n 5 --oneline, which will show one liner logs for the last 5 log entries
  5. If you checkout to master (git checkout master), the README file will disappear from the folder!
  6. Let’s now merge this branch into master. We will need to navigate to master branch in which we are merging the branch:
  7. git checkout master
  8. git merge new_feature (from master)
  9. To push the changes, use the command: git push -u origin master
  10. This will push the README.md file to the same GitHub repo. It will also create a readme preview of the file in the repository. Do refresh the repository to see the changes!

1.3 Further Reading: Git & GitHub

  • Pro Git book, written by Scott Chacon and Ben Straub — Book
  • Introduction to Git & GitHub — Video Series
  • Git Quick Reference — Post

2.0 Numpy

  • Numpy is the core library for scientific computing in Python
  • The object type is a a multidimensional array called ndarray, numpy has functions for working with this ndarray
  • An array is just a contiguous block of memory where every element has the same type and layout, for a CPU
  • These elements can be dynamically transformed into different (acceptable) data-types for computation

Pre-requisites

  • You have installed Anaconda (download)
  • Step-by-step guide
  • Numpy is commonly imported as np.
  • If you are new to Numpy, datacamp has an awesome hands-on introduction

2.1 Matplotlib

  • Most popular visualization library in Python
  • Matplotlib can help make line plots, histograms, power spectra, bar charts, errorcharts, scatterplots among others
  • A pyplot submodule is imported from matplotlib commonly as plt

2.2 Working with Numpy

  1. Let’s upload the IPLM1.csv file into Anaconda’s working directory. You can work from local too. Download IPLM1.csv from here. I find it easier when the file is uploaded in Anaconda. It is internally managed in the system user/Documents folder.
  2. On jupyter notebook’s home → Navigate inside Documents folder. Click the Upload button on top right corner. Browse and select the IPLM1.csv file and click on the blue Upload button once more
  3. Yay! Let’s start the magic by importing Numeric Python in a new Python 3 notebook.
  4. Open Anaconda → Click on jupyter notebook → NewPython 3
  5. Command: import numpy as np

Objectives (Basics):

  1. Create a array containing only number 1, of shape (3,4,2) (rows, columns, layers) say x
  2. Create random integer array of same (4,2,3) say y
  3. Reshape the shape of y to (3,4,2) (Total elements for y = 4*3*2 = 24 = 3*4*2 should match)
  4. Get x+y in z
  5. Transpose z
  6. Print some common properties of z

Solution (Basics):

  1. Create & Check x:

2. Create & reshape y

3. Get z and let’s print some properties & perform operations on z

Objective(s) #Operations on csv file :

  1. Import csv file into a numpy ndarray by using genfromtxt function
  2. Count the number of unique batsmen who faced a ball in the 1st innings (innings=1)
  3. Find the sum of extras bowled in the 2nd innings (innings=2)
  4. Find the count of sixes in the first innings
  5. Find the total number of deliveries faced by BB McCullum and calculate his strike rate (=sum of runs/count of delivery * 100%)

Solution(s) #Operations on csv file:

  1. Import IPLM1.csv :

2. Count number of unique batsmen who faced a ball, i.e strikers in first innings

  • Let’s find the column of the numpy array which points to the innings of the IPL match. It’s column 1, (remember it’s indexed n-1) called innings! Now, remember we imported all elements as ‘|S50’ or string type. To pass on a condition to the array for the 1st innings, we need to convert innings into type integer. This conversion done deffly by numpy function, astype(np.int64). Now your task is find for the other datatypes!
  • This gives takes us to pass the condition to 0th column with value = 1 i.e ipl1_array[:,0].astype(np.int64)==1
  • Now, the rest of the steps are logically simple. We need to take a subset of the ipl1_array which returns the batsman of innings=1, a set() function would retrieve the unique values from the array. Subsetting the array first gives us the names of all the batsman, ipl1_array[ipl1_array[:,0].astype(np.int64)==1][:,3]
  • Then a set function would retrieve only the unique values. The length of the set would give us the count.
  • unique_batsmen=set(ipl1_array[ipl1_array[:,0].astype(np.int64)==1][:,3]) # returns a subset of unique batsmen who batted in 1st innings
  • print(len(unique_batsmen))

3. Find the sum of extras bowled in the 2nd innings (innings=2)

  • extras_inn2=ipl1_array[ipl1_array[:,0].astype(np.int64)==2][:,7].astype(np.int64)
  • There is a function called sum in numpy, which will add all the elements print(np.sum(extras_inn2))

4. Find the count of sixes in the first innings

  • Here we will need to pass two conditions for subsetting, innings=1 and runs=6. This can be used by using the & operator along with conditions within curly brackets for subsetting.

Our good friend len will give the count

  • sixes_inn1=ipl1_array[(ipl1_array[:,0].astype(np.int64)==1) & (ipl1_array[:,6].astype(np.int64)==6)][:,6].astype(np.int64)
  • print(len(sixes_inn1))
  • print(sixes_inn1.size) will also return the number of elements, since sixes_inn1 is an array

5. Find the total number of deliveries faced by BB McCullum and calculate his strike rate

  • Let’s subset the deliveries faced by McCullum in an array called deliv_bbm. We have to pass the condition of astype(np.str) for string to compare it with the value of ‘BB McCullum’.You might recall that the original imported string is of length 50.
  • deliv_bbm=ipl1_array[ipl1_array[:,3].astype(np.str)=='BB McCullum'][:,1].astype(np.float64)
  • Similarly let’s get the runs array as a float value.
  • runs_bbm=ipl1_array[ipl1_array[:,3].astype(np.str)=='BB McCullum'][:,6].astype(np.float64)
  • strike_rate_bbm=np.sum(runs_bbm)/len(deliv_bbm)*100
  • print(strike_rate_bbm)

Objective (plot): Draw and download into local — the frequency histogram of runs scored by BB McCullum in the first innings

Solution (plot): We need to first import the module which has the functionality of plotting a histogram.

2.3 Further Reading: Numpy & Matplotlib

  1. Numpy@datacamp
  2. Thinking in Arrays — here
  3. Matplotlib@datacamp

3.0 Pandas

  • Built on numpy library, pandas provides a fast and efficient way to handle heterogeneous columns of the same dataset
  • Intuitive and easy to use for data analysis & manipulation
  • It can handle ordered and unordered, labelled or non-labelled data
  • Built-in I/O methods with efficient handlers for many filetypes (csv, xlsx, databases)
  • Pandas has built in methods to handle missing data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Powerful functionalities: groupby, merge, pivoting, time-series handlers among many others
  • If you are new to Pandas, do spend 10–15 mins here

3.1 The Hows of Pandas: IPLM1 dataset

  1. Let’s upload the same IPLM1.csv file into Anaconda’s working directory, if not done already. You can work from local too. Download IPLM1.csv from here. It is internally managed in the system user/Documents folder.
  2. On jupyter notebook’s home → Navigate inside Documents folder. Click the Upload button on top right corner. Browse and select the IPLM1.csv file and click on the blue Upload button once more
  3. Let’s start the magic by importing Numpy & Pandas in a new Python 3 notebook.
  4. Open Anaconda → Click on jupyter notebook → NewPython 3
  5. Command: import numpy as np
  6. import pandas as pd
  7. Let’s start by importing the csv file onto a DataFrame:
  8. ipl_df=pd.read_csv('Documents/IPLM1.csv')
  9. ipl_df.head() #Will return the first five rows by default, any number ex. 10 inside it will return that many rows (Rows indexed 0-9)
  10. There is an indexing, alas! again from 0 in the first column, and for those of you who have worked in excel it may feel like one of the default table styles!
  11. Some common properties
  12. ipl_df.info() #to know the datatypes in column
  13. ipl_df.describe() #run summary statistics on numerical columns
  14. ipl_df.rename(columns={'non_striker':'runner'}) #rename a column from one name to another, adding another argument, inplace='True' will modify the DataFrame itself

Objective(s) #Operations on the DataFrame:

  1. Find the count of batsmen who have faced a delivery
  2. Plot of deliveries faced by each unique batsman
  3. Find the names of batsmen who were ‘run out’ during the match
  4. Find the frequency of ‘runs’ for example how many sixes, fours, twos & ones
  5. Get first 10 deliveries for the 2nd innings of the match
  6. Find out team-wise total runs
  7. Pie-chart of total runs scored by each batsman in the first innings
  8. Box plot of runs scored by RCB (Royal Challengers Bangalore)

Solution(s) #Operations on the DataFrame:

  1. We need to take an array out of the batsman column. Subsetting is pretty similar to numpy. There are multiple methods to retrieve the unique ‘batsman’ per se. Two functions are there to help with the values & counts.
  • unique() for getting the values and nunique() to get the counts of unique values
  • We can take an array output of the unique batsman by: ipl_df['batsman'].unique() and then take a len() of the array.
  • We can also use the loc() function to index in an array like manner: ipl_df.loc[:,'batsman'].nunique()
  • If we wish to have a frequency of occurrence the function value_counts, helps get it into a pandas series, series is per se a list with internal indexing: print(ipl_df.loc[:,'batsman'].value_counts())

2. Now pandas is quite powerful for visualization, since data is already present in a DataFrame. The way we call matplotlib in pandas is much more straightforward. Let’s try to make the frequency graph with different batsmen. Kindly note that frequency is a misnomer and it’s merely a bar chart. So let’s do that.

3. Let’ now find the batsmen who were ‘run out’ during the match. This will require a condition being passed onto the pandas DataFrame. From ipl_df.head(2) it seems we need to filter by out_type & get the out_batsman.

  • We need to pass the condition to the DataFrame: ipl_df['out_type']=='run out']
  • And retrive only the batsman names: ipl_df['out_batsman'][ipl_df['out_type']=='run out']
  • The output will be something like this:
  • Name: out_batsman, dtype: object
  • 197 AA Noffke

4. For the frequency of runs overall, it would be probably nifty to use the value_counts() function again: ipl_df['runs'].value_counts()

5. For first 10 deliveries for 2nd innings, we will need to subset the same, with a innings condition & take a head out of it for the first 10 rows: ipl_df[ipl_df['innings']==2].head(10)

6. To find the runs, we would use an inbuilt sum function passed on to the more famous groupby function: Now we need to group runs column by batting team while getting the sum of runs column.

7. Pie-chart of runs scored by batsmen in 2nd innings. It should be easy, we just need to pass an innings condition to our previous graph

8. Let try the box plot for runs scored by RCB in the 2nd innings

  • ipl_df[ipl_df['innings']==2]['runs'].plot(kind='box')

3.3 Further Reading: Pandas

Hope you had fun with the IPL dataset. We will cover more of visualization in the next post.

GreyAtom

GA DS

Sonik Mishra

Written by

Artificially Intelligent | Finance Professional | ML Hobbyist | IIM Indore | NIT Rourkela

GreyAtom

GreyAtom

GA DS