Web Scraping IPL Statistics

Shubham Bhandari
8 min readAug 11, 2019

--

Photo by Aksh yadav on Unsplash

We will be creating a CLI application that will extract player and team data statistics from the official website of the Indian Premier League. The app will be able to extract about 30 different statistics for the year 2008 to the latest edition.

Cricket is a bat and ball game played between two teams comprising of 11 players each. It is the second most watched game in the world and IPL or Indian Premier League is one of the most rewarding and watched leagues of this game.

Enough talking, Let’s dive right into it.

We’ll be using Beautiful Soup for the job. It is a Python library used to extract data out of HTML and XML files. It works with a parser to provide idiomatic ways of navigating, searching and modifying the parse tree.

Next, we will be using Requests. Request will help us in downloading the HTML content from the IPL official website.

Since we are creating a CLI app we will not be providing the user with GUI but will provide a CLI interface using PyInquirer. It is a collection of common interactive command-line user interfaces.

We will also be using re: to use regular expressions, os: to handle directory address, signal: to handle keyboard interrupt, sys: again to handle keyboard interrupt and exit the application gracefully. The last in the list is, of course, pandas and numpy: for data manipulation.

I assume that the reader has basic knowledge of these packages or can search the function as he encounters it. The reader is also assumed to have a basic knowledge of HTML syntax.

Stats page of the official website of IPL.

I will also recommend opening the IPL website and the developer tools of your favorite browser to follow along.

Overview of the application

The application will obtain subsequent data from the user:

  1. Year: User can select years from 2008–current year and all-time records.
  2. Stats: Since the website provides many datasets the user can select some or all of them.

After this, we will construct a URL. The URLs are of the form

https://www.iplt20.com/stats/ + <year/> + <stats name>

Once we construct the URL we will use the Requests package to download the HTML. This HTML will then be parsed using HTML parser and will be converted to a Beautiful Soup object. This Beautiful Soup object will then be used throughout to extract meaningful data.

When completed our app will look something like this:

Querying user regarding the years for which data is to be extracted
Querying user regarding the years for which data is to be extracted data.
Data collected and saved in the form of a CSV file

Let’s dive in the code.

Create a Python script file, name it whatever suits you. I am naming it scrapper.py (So original!!). Import the libraries as shown below:

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import numpy as np
from PyInquirer import prompt
import os
import signal
import sys

So far so good, let’s get our hands dirty in some Python. Shall we?

First of all, we must make arrangements to call the main() function when our script is called. Also, the UI may give exception in some IDE’s (frankly, it’s PyCharm). Therefore we will surround calling the main function with try-except block. Through which the application will gracefully exit if it can’t run.

For now, leave the main() function blank, we will be coming back to it.

We will also be extracting the data of the value of years and stats that the website host. This will make the app future proof. When new season’s data will be added to the website, the app will be able to automatically capture that. In the following functions, we will be getting all the years and stats available on the website.

This is what is happening in the above snippet:

We are obtaining all the year values that the site hosts along with the available stats.

We call the function get_year_stats() in our main() function. This function then obtains HTML from the url, creates its soup object and passes it to get_years() and get_stats() methods.

The stats have two variables stats_url, stats_title. The former store page name of the stats page and the latter stores the title that will be displayed to the user.

You see, this is what I am talking about. The red line is showing page name while the orange ellipse is highlighting page title. Notice how both differ.

get_years(soup): Extracts years data from the website. Let me break down line number 21. It is composed of three parts:

  1. re.compile(r’sub-menu*’): Creates a regex to find tags having ‘sub-menu’ as a substring in its class name.
  2. soup.find_all(‘a’, class_=re.compile(r’sub-menu*’))) : Finds all the ‘a tags with the specified class name.
  3. years = [year.get_text() for year in soup.find_all(‘a’, class_=re.compile(r’sub-menu*’))] : Creates a list comprehension by applying get_text() function on all the elements of the result of the find_all() method.

In line 22 we are replacing ‘All Time Records’ with ‘all-time’. Former being display name and latter being the part of the URL.

Similarly, we will extract the stats name and stats page name in get_stats(soup) function.

In the next step, we will be creating a list of dictionaries for PyInquirer. this will provide the user with options to select year and stats.

We will use the years list to create a list of dictionary with ‘name’ as key and year as value.

We will use the stats_title list as name to be displayed to the user and stats_url as its value, which will be returned to the program when a user selects a particular stats.

For example name: Most Runs and respective value: most-runs. Value is the name of the page in URL and the name is the title of the page.

The following code prepares a PyInquirer query and stores their response in answers variable

Here we are using a while loop to handle no response from the user.

TypeError to handle CTRL + C

EOFError to handle CTRL + D

Let’s recap what we have done till now:

We have obtained year values, stats values (page name and page title) and created a query to obtain user response and store it in a variable called answers.

Photo by Craig Hughes on Unsplash

It’s time we get to the main part that is extracting player’s data:

We define a function called scrap_data(years, stats) that will iterate over all the selected value of years and stats page names. All the pages can be divided into two parts:

  1. The team ranking page
  2. The player stats pages

All the pages of the first type have the same HTML layout and the same goes for the second type. Therefore, comes line # 6. As stated in the first case we don’t need any page name and in the second case, the stats name is concatenated to our base URL.

We then define a function called get_page(url, team) that will do a preliminary check and create a soup object of the page. The team argument is a boolean value set to True for the first type of page and False for the second type.

The rest of the code in the above snippet is self-explanatory. We call find_col (soup, team) function in get_page(url, team) which will return the header values of all the columns present in the table on the webpage.

To proceed we create three function find_col(soup, team), get_team_data(soup,columns) and get_player_data(soup,columns).

The first function will find the column headers, and the second and third function will scrape the first and second type of data.

The whole magic is happening in line #5, #8, #18 and #27. Let me break these lines for you:

#5: columns = list(filter(None, soup.find(‘tr’, class_=’standings-table__header’).get_text().split(‘\n’))) :

Finds all the ‘TR’ tags with the specified class name, extracts text from it and then splits it to form a list. At last, remove the blank value to get all the column headers.

#8: columns = re.sub(r’\n[\s]*’, ‘\n’, soup.find(‘tr’, class_=re.compile(r’top-players__header*’)).get_text()).strip().split(‘\n’)

Again finds ‘TR’ tags with class that has ‘top-players__header’ present in the class name, gets text from it substitute all the newline and space characters with just new line characters and then remove the blank space. At last splits it using ‘\n’ as a delimiter to form a list.

#18: for i in soup.find_all(‘td’):
data.append(re.sub(r’\n[\s]*’, “ “, i.get_text().strip()))

This line extracts all the ‘TD’ tags and formats the values to get all the team data.

#27: for i in soup.find_all(‘td’, class_=re.compile(r’top-players*’)):
data.append(re.sub(r’\n[\s]*’, “ “, i.get_text().strip()))

Extracts player data with tag name ‘TD’ and class having ‘top-players’ substring. After which formats it to form a list called data.

In the penultimate line of the function get_player_data(soup, columns) we convert the data list to a data numpy array and also reshape it from a 1D array to a 2D array. The number of columns is specified by the length of the column list and the number of rows by integer division of length of data list by the length of the columns list. The same goes for the third last line of get_team_data(soup, columns).

So, we now have the extracted data in a numpy 2d array called data and column headers in a list called columns.

The next step is to convert it to a dataframe and save it in the form of a CSV file.

We will create a save_data(data, columns, data_set_name, file_name) function which will create a dataframe from the data and a file path based on the data in the dataframe. After which it will save dataframe in the form of a CSV and notify the user.

The rest of the three functions are to handle keyboard interrupts.

Whew! Nearly done. Just one more step. To tie all the code together we will edit our main() function.

The main() function will call get_year_stats() to get year values, stats page names and titles. After which we will call prepare_question(years, stats_title, stats_url) to prepare the query for the user. We will get user’s input from user_input(years_q, stats_q) function and then get scrapping by calling scrap_data(years, stats) function.

And we are done!! Yipeee. Congratulations you created a CLI capable web scraper for the IPL player stats.

Check out the Github repository for complete code:

It was the first of many to come medium articles. Comments are always welcomed.

👏 if you like the article and learned from it.

--

--

Shubham Bhandari

I am an aspiring data scientist. Currently pursuing Masters in Data Sciences and Business Analytics at ESSEC and CentraleSupélec. Sometimes, I write too.