Grabbing data from the internet: Web scraping

Will Karnasiewicz
May 6, 2019 · 5 min read

Whether in college, graduate school or a 12-week intensive data science bootcamp, when you first start off you’re often just handed a dataset to work with in an easy to read format such as a .csv (comma separated values). You load your data, clean your data, build and evaluate your model and go on your merry way. But then you quickly realize: you don’t just want to work with 40 year-old housing price data from Boston. And not every piece of data is structured data, i.e. data with a high degree of organization such as data residing in a relational database or tabular data format.

In fact, “[a]ccording to IDC, the total volume of data will reach 163 zettabytes in 2025. It is expected that 80% of this will be unstructured data.” (Capgemini, 2018)

So if it’s not available in a structured format how do I gather the data? One answer is Web scraping!

Web Scraping

Web scraping is a method of extracting data from websites. Part art, part science web scraping allows you to programmatically dig into the HTML and grab the information you need and transform into structured data that can be easily consumed.

But first a caveat! Many websites outlaw or effectively outlaw scraping, so make sure you know what rules are in place before scraping with wild abandon.

First check the site’s robots.txt page which can be accessed at <your_url_goes_here>/robots.txt. Here the website outlines its Robots Exclusion Protocol, a standard format to communicate with web crawlers and other web robots about which parts of the website can and cannot be accessed. The robots.txt page lists user agent names to identify specific bots and a list of web paths that the crawler is allowed or disallowed from the scraping.

Some websites disallow scraping altogether, including, not surprisingly, Facebook.

So before you get started scraping, pull up the robots.txt and look for User-agent: * which applies to all bots. Looking at the screenshots above, you’re prohibited from scraping all Facebook web pages, while you’re allowed to scrape the official world golf rankings as long as you don’t scrape their internal content management system pages at the /Sitecore/ path.

Scraping the Official World Golf Rankings

So let’s run through how to create a Python script for scraping the top golfers in the world according to the OWGR. We’re going to be using the following packages: requests for easily making HTTP queries, Beautiful Soup 4 for parsing data from HTML, and pandas for easily transforming our data to a structured form.

Next, we define the url path to which we want to make a request. I’ve already spent a fair bit of time on the Official World Golf Rankings website, so I know that we want the page with the top 300 players in the world.

Then, using requests and Beautiful Soup 4, we can generate a Beautiful Soup object of the webpage. The Beautiful Soup object is a representation of the whole document as a tree of HTML tags. You can also specify a parser; the “lxml” parser is most commonly used due to its speed and flexibility when parsing HTML.

Now let’s roll up our sleeves and dig into the HTML of your webpage with your browser’s extension. (For Google Chrome, right click and select ‘Inspect’; for Apple’s Safari browser Ctrl + Click and select “Inspect Element”, you may need to first enable the “Develop” menu by going to Preferences > Advanced.) Using your element selection tool you can quickly highlight an element and identify the corresponding HTML tags.

Using the .find() method on our soup object, we can grab the first HTML tag that matches certain criteria. In this case, we want the first <table> tag.

Digging into the html further, we notice that the first row (<tr>) is within the table head (<thead>) and represents the titles of each column of the table. Then in the table body (<tbody>), each row represents each player with different table cell tags (<td>) for each value associated with that player. We can use find all to select all the rows in the table and iterate through all the rows. At each row, we can then select any elements that we want.

Let’s walk through the first assignment:

In this case, each row contains a <td> tag with the format <class=“name”>. Inside that tag there is the name of the player (i.e. Dustin Johnson in the first row) as text. Using .find() as above, we specify the name of the tag (‘td’) along with any identifying attributes as a dictionary with the type of attribute as a key, e.g. {‘class’: ‘name’}. If there is a matching tag, we can also grab any attributes, including in this case, text.

We can repeat the process for any information that we want to grab, including, in this case, country and rank. Once we have the entire table as a list of dictionaries we can use pandas to easily transform the data to a dataframe and export in a structured data format.

As you might have gathered from this walkthrough, although the workflow might be similar, the exact steps can vary greatly because webpage layouts can vary widely! And if a page is redesigned then your script needs to be rewritten.

But have no fear; this is where your creativity comes in! So go out there, pick a suitable website, and get scraping!

P.S. You can find this example script and more on my Github repo.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store