Web Scrapping Cricket ODI data from HowStat and Preprocessing the data Part 1

HARDI RATHOD
Analytics Vidhya
6 min readMay 26, 2019

--

Cricket is one of my favorite sports(although I can’t play it to save my life). And since World Cup is drawing close, I am sure millions of cricket fans are trying to predict who is going to take the cricket glory home. I had carried out some predictions for Indian Twenty20 Cricket League called IPL, which is present at iplpredictormatches.pythonanywhere.com. I am hoping to make similar predictor machine for World Cup, too. And, this article covers the first step for this — Data Gathering Phase.

Data Gathering Phase is a task that can take up to 70 to 80% of your total time dedicated to any project. For gathering data, I am going to use Web Scraping as all major cricket data is present on the web and we can easily access it through web scraping. HowStat is an excellent structured cricket statistics site that I will be using in this article. Another great option is espncricinfo.com.

For this article, I will only be carrying out only two tasks

  1. Finding all the players that have ever played an ODI match and
  2. Finding the scores of all the players in each year and how many matches they played in that year.

Let’s start with the first task. For web scrapping, we will need the following basic libraries which we will first import:

Filename : scrapping.py

import pandas as pd  # file operations
from bs4 import BeautifulSoup as soup #Scrapping tool
from urllib.request import urlopen as ureq # For requesting data from link
import numpy as np
import re

Next, we will write code for web scraping using Beautiful Soup:

For the URL, I go to HowStat Website and decide to first take the data of the players with alphabet starting from A — as they provide different sets of players on their starting character for name. For simplicity, we first take the character A.

Hence, the website URL is http://howstat.com/cricket/Statistics/Players/PlayerList.asp?Country=ALL&Group=A. Go to this website link and press Ctrl+Shift+I to Inspect the HTML Code. Through this, you can understand the location of the needed data in the HTML code. This is important as we will scrap through HTML code. Next, since we need data of all the players that can be seen, we have two options.

  1. Take each data individually.
  2. Take the whole table.

Obviously the second idea is more appealing and requires less lines of code.

For this, we need to see the table in HTML code and find the content of class attribute so that our code can find it uniquely.

table.tablelined in the above picture shows that for the table tag we have class attribute value as TableLined.

Hence, our code becomes:

url = "http://howstat.com/cricket/Statistics/Players/PlayerList.asp?Group=A"
pagehtml = ureq(url) #opening the URL
soup = soup(pagehtml,"html.parser") #parse the html
table = soup.find("table", { "class" : "TableLined" })

Now, we need to access the individual cells of this Table present in table variable. We need two for loops — one traversing rows and another traversing columns. We will store this data in new variable data.

for x in table:
rows = table.find_all('tr') #find all tr tag(rows)
for tr in rows:
data=[]
cols = tr.find_all('td') #find all td tags(columns)
for td in cols:
data.append(td.text.strip())

Let’s save this data in CSV File. So, our modification to the code yield’s:

with open('AZ.csv', 'a',newline='') as csvfile:
f = csv.writer(csvfile)
for x in table:
rows = table.find_all('tr') #find all tr tag(rows)
for tr in rows:
data=[]
cols = tr.find_all('td') #find all td tags(co)
for td in cols:
data.append(td.text.strip())
f.writerow(data)
print (data)

Note: Here we used append i.e ‘a’ as operator so that whenever we need to add new data, our previous data won’t be erased.

If you check the CSV file created, you get all the data from the table.

Next, let’s preprocess this data.

If you observe the CSV file, in areas where ‘0’ or ‘null’ should be present, we have nothing.

So, let’s replace the missing values with ‘0’. For this we use Pandas Library.

In this, first we read the csv file and then use isNull() method to get all the missing places. True implies that the value is Null.

After this we simply replace the null values by ‘0’.

Here’s the code:

df = pd.read_csv("AZ.csv",header = 1)
# Header is 1 so that unnecessary row is removed
print (df['ODIs'])
print (df['ODIs'].isnull())
df['ODIs'].fillna(0, inplace=True)
df['Tests'].fillna(0, inplace=True)
df['T20s'].fillna(0, inplace=True)
print (df)
df.to_csv('AZ2.csv', index=False)

Now we did this code for only alphabet A players.

Lets follow the code for other players as well. The only change that comes in the code is in the URL, where the Group Identifier changes to A,B,C….Z.

So, we will now put the whole code in a function and use the ASCII values of the alphabets and put that in URL.

e.g 65 for A, 66 for B, 67 for C and so on.

Check out the new Code:

import csv # to do operations on CSV
import pandas as pd # file operations
from bs4 import BeautifulSoup as soup #Scrapping tool
from urllib.request import urlopen as ureq # For requesting data from link
import numpy as np
import re
def scrap(x):
global soup
x = chr(x)#to change integer into character
url = "http://howstat.com/cricket/Statistics/Players/PlayerList.asp?Group={}".format(x)
pagehtml = ureq(url)
soup = soup(pagehtml,"html.parser") #parse the html
table = soup.find("table", { "class" : "TableLined" })
with open('AZ.csv', 'a',newline='') as csvfile:
f = csv.writer(csvfile)
for x in table:
rows = table.find_all('tr') #find all tr tag(rows)
for tr in rows:
data=[]
cols = tr.find_all('td') #find all td tags(columns)
for td in cols:
data.append(td.text.strip())
f.writerow(data)
print (data)
#Handling the Missing Values using Pandas
df = pd.read_csv("AZ.csv",header = 1)#so that unnecessary row is removed
print (df['ODIs'])
print (df['ODIs'].isnull())
df['ODIs'].fillna(0, inplace=True)
df['Tests'].fillna(0, inplace=True)
df['T20s'].fillna(0, inplace=True)
print (df)
df.to_csv('AZ2.csv', index=False)

Now , we will call the scrap() function by giving values.

e.g for getting Players with alphabet B use:

scrap(66)

Sequentially, one can do this for all the alphabets and the players data will be appended in the CSV file.

Note : We have two CSV files —

  1. AZ.csv — contains raw data.
  2. AZ2.csv — contains preprocessed and cleaned data.

Now let’s find the players that have atleast played once in ODI matches.

Similarly, it can be done for Tests and other formats.

Here is the code :

4539 are the total number of players.

Filename : odiplayers.csv

import csv
with open('AZ2.csv') as csvDataFile:
data=list(csv.reader(csvDataFile))
#converting the data to list so we can access it easily
print ("The Players who Played ODI and the no. of times they played are:")
print("-------------------------------------------------------")
for i in range(1,4539):

if data[i][4]!= "0.0":
print(data[i][0],data[i][4])

In Part 2, I will solve the second problem i.e Finding the scores of all the players in each year and how many matches they played in that year.

--

--