Scraping multiple pages with Workbench

Workbench
Workbench
May 28, 2019 · 8 min read

For some time it’s been possible to do simple web scraping with Workbench. Today we’re announcing several features for more complex scrapes.

Image for post
Image for post

This post will teach you how to:

  • Scrape results which span multiple pages
  • Monitor a site for changes and get notified when new data is available
  • Scrape information which is not formatted in HTML tables
  • Scrape the URLs from links, not just the text

Figuring out which scraping technique you need

Web pages are designed for humans to read, not for computers to process data. Every web page is a beautiful flower, coded just a little bit differently than the others, which means that there is no one-size-fits-all scraping method. This tutorial will walk you through identifying the right scraping tool for the job and putting it to work.

1. Scraping one HTML Table

If you’re lucky, your page will present the data you want using the built in of HTML, the <table> tag. Even if you don’t know how the page you want to scrape is coded, Scrape Table is certainly the simplest scraping method to try.

Let’s use it to scrape the list of recent audit reports from the California state government, at .

Image for post
Image for post
  1. Create a new workflow and choose Scrape Table as your data source.
  2. Paste the URL of the page into the URL field
  3. Press Scrape

You can see the completed workflow .

Image for post
Image for post

If there is more than one table on the page, you may need to set the Table’s position on the page number to 2 or higher. You may also find tables where the column names incorrectly appear as the first row of data, which you can fix by selecting the option First row is header.

Many simple web pages publish tables that can be scraped using this method. It works for most tables in Wikipedia, and even large tables like of housing permits published by the City of San Jose.

Monitoring a page for changes

Scraping is often used for monitoring. In the page above, we’re interested in knowing when a new report is published.

Like other data sources on Workbench, you can set your scraper to Auto in order to run it on a schedule.

  • If new data is found, it will be automatically saved as a new .
  • You can choose to have an email notification sent to you.
Image for post
Image for post

2. When Scrape Table doesn’t work, try Xpath

Sometimes the information on a page might look like a table, but the HTML might be coded in some different way. For scraping purposes, we need to deal with how the page is coded, not the way it looks. If you like, you can see how a page is coded by using the built into your browser.

Let’s suppose we want to scrape the University of British Columbia’s , which looks like this:

Image for post
Image for post

If the data on your page is not within an HTML <table> element, Scrape Table will tell you:

Image for post
Image for post

More advanced scraping jobs require two steps: downloads and save the HTML source code of one or more pages, which contains all the data, and more. HTML to Table extracts the actual data we’re interested in.

To scrape our :

  1. Start with Scrape HTML, enter the URL, and press Scrape. The result of this step is a single row, containing the complete HTML source for that page (as well as some other information we’ll get to below.)
Image for post
Image for post

2. Add an HTML to Table step and choose the Xpath selectors method.

Image for post
Image for post

Xpath selectors

Workbench needs you to specify what content from the page should go in each column of your new table. Each column is defined by an “XPath selector” which is a short piece of code written in a special language designed for selecting parts of web pages.

Image for post
Image for post

Xpath is a language based on the , which you can view using your .You can learn how to do all sorts of amazing things with xpath , but this can get complex.

For selecting content on a webpage, there is a simpler way: The browser extension lets you click on the elements you want in your browser, and automatically writes the necessary XPath selectors for you.

Image for post
Image for post

Watch this short video to learn how to use Selector Gadget, or follow the instructions below it.

  • After installing the extension and activating it on our (in Chrome, press the little magnifying glass icon in the upper right) we can click on one of the course titles to select the elements we want.
  • Once all the elements you want to get are selected in green or yellow, click on the XPath button to get the code that Workbench needs to extract these elements. In this case, the xpath for the title is //b.
  • Paste it back into Workbench, in the XPath selector field. Name the column Title.
  • Click the play button to extract this information from the previously scraped HTML.

You scrape the course descriptions in the same way. Click +ADD to create a new column, then repeat the above steps. You should end up with an xpath of //dd for the descriptions.

You can see this scrape in action .

Image for post
Image for post

3. Scraping multiple pages

Suppose we want to scrape of audit reports from the Colorado state government, at . It’s possible to isolate the titles, release date, report numbers, and so on using Selector Gadget.

Image for post
Image for post

But there’s a problem: this list spans multiple pages, which are accessed using the Next button at the bottom. If we paste into Scrape HTML we will get only the first page of results.

Fortunately, it’s possible to scrape multiple pages using Scrape HTML, if the page number appears in the URL.

Image for post
Image for post
  • Paste the URL without the page number into Scrape HTML.
  • Select the Series of numbered pages checkbox.
Image for post
Image for post

Notice that the first page number is zero by default, not one. Like many sites (but not all) leg.colorado.gov counts pages from zero, so when you go to the second page the URL ends with page=1.

  • We’ll scrape the data from page 0 to 9 (For now, the limit is 10 pages).
  • Press Scrape. Workbench will download the HTML file for each page (it can take a few minutes), and produce a table with one row per page scraped. The columns show which URLs were scraped, when, if any errors occurred (“200” means “no errors”) and finally the raw HTML.

We still need HTML to table to extract the information we want from all this raw HTML source code.

Use Selector Gadget to set up each column you want to scrape, and HTML to Table will automatically combine the results from all pages.

Here is the .

Image for post
Image for post

4. Extracting link URLs with XPath

Xpath is so flexible you may want to use it even when the simpler Scrape Table method works. For example, you can extract the URLs from links with the xpath method, while Scrape Table only extracts the link text.

Let’s return to our colorado.gov .

Each title links to a page with a summary of each report and a download button. Let’s collect all the links to the download pages, not just the title text. If you’ve already got the link text extracted, this requires only one small change to the XPath selector.

  • Copy the XPath from the column title, paste it in a new column and Add /@href to it.
Image for post
Image for post

Note that the extracted URLs are to the HTML , meaning that they look like /audits/division-youth-services-reporting. To turn them into a URL that you can use in a browser or feed to another HTML Scraper step, we need to prepend https://leg.colorado.gov to each link. You can do so with a Formula step.

Image for post
Image for post

You can see these steps all put together in final form, extracting the links from the first ten pages, in .

Image for post
Image for post

A Growing Toolkit

With the set of tools described in this post you will be able to scrape many different kinds of sites, especially if you invest some time learning how to read the structure of HTML and how XPath works. But there are cases where more advanced tools are needed, such as sites which require you to submit a form or a search query before they show results.

We are building out a suite of progressively more powerful scraping tools in Workbench — and you can help: , and send us links to pages you need to scrape, with just a few words to give us context.

We’d love to hear about your scraping problems!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store