Mudassir Khan
GreyAtom
Published in
4 min readJun 30, 2017

--

Web Scraping

As an aspiring data scientist you tend to bump into all sorts of data problems there are like missing data, incomplete data, varying formats etc and converting them into structured error free data is the basis for any further work. At one point or the other getting data from websites becomes inevitable and you have to learn atleast some basics of web scraping to get the work done. I have used cricsheet’s data (gives ball-by-ball structured data of cricket matches) extensively for a lot of cricket analysis and stats but it has limited coverage and I wanted some old data not covered by them and my only options were to either manually enter each ball’s data or scrape it from the web which looked liked an improbable task for me.

Good for me I decided to scrape it and that’s when I ran into BeautifulSoup which is a Python library for pulling out data from HTML and XML files. Using BeautifulSoup, Pandas ,Regular Expressions and days of learning and hardwork, I was finally able to write a one click script that would read in a csv file containing links of missing cricket matches into a format that I can push into my data preparation pipeline. I’ll share some of my learning in my process of making this.

Installing BeautifulSoup

If you are using anaconda, it should be already installed and if you don’t use anaconda follow this link.

HTML Tags

Learn a few basic tags if you don’t know them already like html, head, body, div, p, a, table and more as you require. Learn about id and class attributes as they are very helpful in navigating to the element. Also understand the structure of html and it helps to think of it as a family tree with tags on the top as parents. You’ll find this terminology a lot in the official documentation of BS.

Initializing Python Code

# import libraries
import requests
from bs4 import BeautifulSoup

# specify url
url = 'http://www.espncricinfo.com/icc-champions-trophy-2013/engine/match/566948.html'

# request html
page = requests.get(url)

# Parse html using BeautifulSoup, you can use a different parser like lxml if present
soup = BeautifulSoup(page.content, 'html.parser')

Next step is to find the required data, the url is a match scorecard and we want to find who won the match. We have to find the shortest unique way we can locate the data in the html. Easiest way to understand where our required data is to find it in the browser. Right click on the page and click inspect. Now click on the page where it says ‘India won by 5 runs’ and the corresponding html should appear in the inspector. Looking at it we can find that it has a class attribute which looks unique. We’ll use it to navigate to the data.

Navigating and Searching using BeautifulSoup

Since we know the unique class of the div, it makes our job easier and we can find it using the find() function.

# find searches the given tag (div) with given class attribute and returns the first match it finds
win_div = soup.find('div', class_ = 'innings-requirement')

# Extracting the text out of the div
win_text = win_div.text

Use some simple regex to extract the team from the text.

# import regular expressions module
import re

# match everything before ' won' in the text
team = re.search('(.*) won', win_text).group(1)

In addition to searching the entire tree you can navigate up, down and within a level easily. To illustrate this lets start from the winning text and navigate to the date of the match. To navigate there we first need to navigate to the parent div using find_parent().

win_parent = win_div.find_parent('div')

Then to the next sibling using find_next_sibling()

next_div = win_parent.find_next_sibling('div')

To navigate to the date div we need to drop down a level into the tree. We can use next_element for that which navigates to whatever the next element is, post that we can skip three divs using find_next repeatedly to the div containing the date.

date_div = temp1.next_element.find_next('div').find_next('div').find_next('div')

These are just some of the ways you can navigate and search for your required data hope you find it helpful.

Tips:

1. Reference the official documentation whenever you are in a pickle.

2. There are many ways you can access the same data, initially don’t fuss about finding the best way but just try to get the job done.

3. Some identifiers can seem unique but might not necessarily be, you can search all tags using find_all() to confirm that.

4. If scraping multiple pages allow your script to pause for a second (use time.sleep(1)) so that the servers don’t get overloaded.

--

--