Analytics Vidhya
Published in

Analytics Vidhya

Image by Corinne Kutz on Unsplash

Web Scraping — BeautifulSoup

You don’t always have the data you need, there are times you need to go get it.

Web scraping can be an effective tool to get publicly available data that sits behind web pages. Such data can have a variety of use cases.

Let’s take an example.

Below is a ESPNcricinfo page that shows test career stats of top batsmen.

Wouldn’t it be neat to load all this data into a python program for analysis instead of manually copying and cleaning it in a spreadsheet? That’s what web scraping can do.

Let’s see how.

We will need the following libraries

1. requests — to send http request to the website for getting its content

2. BeautifulSoup — to do all the beautiful work of web scraping

3. pandas — for analysis

Import libraries

Here are the steps we need.

  1. Request entire html content of the webpage
Request content

2. Create a BeautifulSoup object passing it the content extracted above

Create BeautifulSoup object

3. Find html tags or elements where required data resides. You would need to know the structure of the site, ‘View page source’ option helps.

In this case, the data resides within class = data1.

View page source — class = data1

All we need to do is find all instances of the above tag. This can be done using the find_all method available in BeautifulSoup. It will store all these tags into a list.

Find all instances of class = data1

4. Extract the data

As you see above, the data is within each of the <td> tags. The below code extracts data values from each of the <td> tags and stores them in another list.

Extract the data

a. Get_text method extracts all the data values into a ‘|’ separated string

b. Inner loop simply removes the newlines

c. Outer loop creates a list named final_data which in turn contains a list of data points i.e. final_attributes for each player

5. Pack the data into a dataframe

Create pandas DataFrame

6. Analyze the data

You are all set to use the scraped data. Let’s create a histogram of the number of hundreds scored by the players.

Create histogram

Not many players with more than 40 test hundreds. Hmm…

Note : While web scraping works well with static html pages, I haven’t tried it on dynamic pages.

A word of caution about web scraping (Credits — Codecademy)

  • Always check website’s Terms and Conditions before scraping
  • Read the statement on legal use of the data (you could get into trouble)
  • Do not spam the website with too many requests as it can break things

Where have you used web scraping ?



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store