Grabbing data from the internet: Web scraping

Will Karnasiewicz
May 6 · 5 min read

Whether in college, graduate school or a 12-week intensive data science bootcamp, when you first start off you’re often just handed a dataset to work with in an easy to read format such as a .csv (comma separated values). You load your data, clean your data, build and evaluate your model and go on your merry way. But then you quickly realize: you don’t just want to work with 40 year-old housing price data from Boston. And not every piece of data is structured data, i.e. data with a high degree of organization such as data residing in a relational database or tabular data format.

In fact, “[a]ccording to IDC, the total volume of data will reach 163 zettabytes in 2025. It is expected that 80% of this will be unstructured data.” (Capgemini, 2018)

So if it’s not available in a structured format how do I gather the data? One answer is Web scraping!

Web Scraping

Web scraping is a method of extracting data from websites. Part art, part science web scraping allows you to programmatically dig into the HTML and grab the information you need and transform into structured data that can be easily consumed.

But first a caveat! Many websites outlaw or effectively outlaw scraping, so make sure you know what rules are in place before scraping with wild abandon.

First check the site’s robots.txt page which can be accessed at <your_url_goes_here>/robots.txt. Here the website outlines its Robots Exclusion Protocol, a standard format to communicate with web crawlers and other web robots about which parts of the website can and cannot be accessed. The robots.txt page lists user agent names to identify specific bots and a list of web paths that the crawler is allowed or disallowed from the scraping.

Some websites disallow scraping altogether, including, not surprisingly, Facebook.

Facebook is explicit in their prohibition of web crawlers!
Standard format of a blanket web scraping ban (from facebook.com/robots.txt)
All bots are allowed to scrape all pages except those with the /Sitecore/ extension (from owgr.com/robots.txt).

So before you get started scraping, pull up the robots.txt and look for User-agent: * which applies to all bots. Looking at the screenshots above, you’re prohibited from scraping all Facebook web pages, while you’re allowed to scrape the official world golf rankings as long as you don’t scrape their internal content management system pages at the /Sitecore/ path.

Scraping the Official World Golf Rankings

So let’s run through how to create a Python script for scraping the top golfers in the world according to the OWGR. We’re going to be using the following packages: requests for easily making HTTP queries, Beautiful Soup 4 for parsing data from HTML, and pandas for easily transforming our data to a structured form.

Next, we define the url path to which we want to make a request. I’ve already spent a fair bit of time on the Official World Golf Rankings website, so I know that we want the page with the top 300 players in the world.

Assigning a value to url variable (we can also make this more flexible with f-strings)

Then, using requests and Beautiful Soup 4, we can generate a Beautiful Soup object of the webpage. The Beautiful Soup object is a representation of the whole document as a tree of HTML tags. You can also specify a parser; the “lxml” parser is most commonly used due to its speed and flexibility when parsing HTML.

Now you have a Beautiful Soup object!

Now let’s roll up our sleeves and dig into the HTML of your webpage with your browser’s extension. (For Google Chrome, right click and select ‘Inspect’; for Apple’s Safari browser Ctrl + Click and select “Inspect Element”, you may need to first enable the “Develop” menu by going to Preferences > Advanced.) Using your element selection tool you can quickly highlight an element and identify the corresponding HTML tags.

The rankings are organized neatly in an HTML table tag.

Using the .find() method on our soup object, we can grab the first HTML tag that matches certain criteria. In this case, we want the first <table> tag.

# Find the first table element on the page
table = soup.find('table')

Digging into the html further, we notice that the first row (<tr>) is within the table head (<thead>) and represents the titles of each column of the table. Then in the table body (<tbody>), each row represents each player with different table cell tags (<td>) for each value associated with that player. We can use find all to select all the rows in the table and iterate through all the rows. At each row, we can then select any elements that we want.

Entire full loop iterating through HTML table

Let’s walk through the first assignment:

player['name'] = rows.find('td',{'class':'name'}).text

In this case, each row contains a <td> tag with the format <class=“name”>. Inside that tag there is the name of the player (i.e. Dustin Johnson in the first row) as text. Using .find() as above, we specify the name of the tag (‘td’) along with any identifying attributes as a dictionary with the type of attribute as a key, e.g. {‘class’: ‘name’}. If there is a matching tag, we can also grab any attributes, including in this case, text.

We can repeat the process for any information that we want to grab, including, in this case, country and rank. Once we have the entire table as a list of dictionaries we can use pandas to easily transform the data to a dataframe and export in a structured data format.

Now we have a CSV with the OWGRs!

As you might have gathered from this walkthrough, although the workflow might be similar, the exact steps can vary greatly because webpage layouts can vary widely! And if a page is redesigned then your script needs to be rewritten.

But have no fear; this is where your creativity comes in! So go out there, pick a suitable website, and get scraping!

The time I won the US Open.

P.S. You can find this example script and more on my Github repo.

Will Karnasiewicz

Written by

Data scientist; CFA charterholder and financial valuation specialist; avid golfer and racquet sport aficionado; homebrewing hobbyist; TWTR: @wmkarney

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade