Web Scraping with Python — Indian Premier League Scores

Ankur Salunke
5 min readSep 11, 2020

--

Source

IPL 2020 is about to kick off in a few days and just like every season data science enthusiasts must be itching to get their hands on match data to slice and dice the data with an intent to gain insights in to the teams/players and the more ambitious of the lot would look to predict the results of these games.

The first part of any exploratory data analysis or model building is the collection/procurement of data. We have the option to wait for someone to collect the required data or we can proactively get the data ourselves.

Web Scraping comes to our rescue as it gives us the ability to collect data from a source all by ourselves and in the format that we would like. Of course there would be some limitations depending on the source of the data but we have greater control since we get to decide how and what data we scrape from the data available at the source.

We would be using espncricinfo.com to scrape the match and scores data as they are kind enough to allow scraping with some restrictions. We would work on IPL 2019 data

The below infographic gives a gist of how we would be approaching this task.

We have broken it down into 3 steps:-

1. First we would extract all the match links from the series results homepage for IPL 2019. We would use scrapy gather all these links.

2. We would navigate all the match links and scrape the match details one by one using scrapy

3. Since we already have all match links, we will navigate to all the commentary pages for the matches and scrape the ball by ball details and save it to a dataframe and finally export it to a csv file. Here we would not require scrapy since we have the data available in JSON format for these commentary pages.

Now let us move on to coding our scripts. We would work on two scripts. One would be a scrapy script to extract the match links and match details. Second would be a regular python script to extract the ball by ball details of all the matches.

I would recommend three resources that would help you get started with Scrapy and XPath. The data received from scrapy HTTP responses contain data in the html format and has to be parsed to data that can be read. XPath is one way to parse the html data.

  1. Scrapy
  2. XPath

One important step before we start trying to parse the response data of the scrapy requests is to study the structure of the web page from which we are trying to scrape the data. Below is an example of the source code for a webpage on espncricinfo.

In google chrome we can right click on the page and select “Inspect”. The window on the right in the image above shows the source code in the Elements tab. We can search for the data we are looking to scrape here and the tags/attributes in which the data is present can be parsed using the XPath syntax.

We would move on to the code now. For the two steps we mentioned earlier, we would be writing a scrapy spider.

Match Link and Details Extraction Scrapy Spider— Part 1

As part of the custom_settings for this spider we would configure the output file and location, and we are setting a download delay of 15 seconds since thats whats recommended by espncricinfo.

In the parse function we loop over the links extracted from the series results home page. Then we navigate to those matches pages using another scrapy request . Here we pass a meta item also that would enable us to retrieve match data for all the matches using our parse_match function.

Match Link and Details Extraction Scrapy Spider — Part 2

In the parse_match function we are extracting the match data from all the match home pages which we had scraped in the parse function.

We extract the data according to the XPath syntax that would get us the required data. In the code beside, we can see that we have done some basic pre processing in addition to just extracting the data.

We use the item meta that we had passed from the parse function and add all the extracted data in the item meta, which then gets passed back to the parse funtion.

The match data plus the match home links are exported to match_details.csv file as we had specified in the spider custom settings.

Studying the source of the web pages would require some time but it would be rewarding in terms of finding the desired data. This step alone takes up 50% of the time but it makes the rest of the task easier.

The scrapy scripts can be run from the terminal using the below command. The scrapy.cfg file should be in one of the parent folder

scrapy crawl cric_match

Now we move on to the part where we extract the ball by ball details of all the matches.

This next section doesn’t use scrapy but only urllib3 and json since we found that we have the commentary data in json format. The espncricinfo commentary pages have infinte scrolling. The pages get loaded using AJAX calls. We isolated the calls from using the chrome developer tools → Network → XHR. We can get to the developer tools using right click and “Inspect”.

Periods are 1 and 2 for the first and second innings. Pages are for the times the page gets extended during infinite scrolling. For T20 matches it is not above 10. League ID is the league ID for IPL2019 in espncricinfo. EventID is the match ID according to espncricinfo. We get the event IDs from the links that we extracted in the scrapy script.

We place a http request using urllib3. Since we have data in the JSON format we load it in “data”. We flatten the data using the json_normalize function.

Post that we pre process the data as per our requirement and finally export the processed data to score.csv .

I hope this gave you an idea about the basic of web scraping. Also since IPL 2020 is about to begin soon, these scripts can be used to create your own IPL 2020 dataset.

The entire code and output can be checked out here at my GitHub.

The data collection here can be improved vastly. Feedback and suggestions are welcome in the responses/comments.

Thank you for the patient read. Cheers !

--

--