Scraping multiple pages with Workbench
For some time it’s been possible to do simple web scraping with Workbench. Today we’re announcing several features for more complex scrapes.
This post will teach you how to:
- Scrape results which span multiple pages
- Monitor a site for changes and get notified when new data is available
- Scrape information which is not formatted in HTML tables
- Scrape the URLs from links, not just the text
Figuring out which scraping technique you need
Web pages are designed for humans to read, not for computers to process data. Every web page is a beautiful flower, coded just a little bit differently than the others, which means that there is no one-size-fits-all scraping method. This tutorial will walk you through identifying the right scraping tool for the job and putting it to work.
1. Scraping one HTML Table
If you’re lucky, your page will present the data you want using the built in table features of HTML, the <table> tag. Even if you don’t know how the page you want to scrape is coded, Scrape Table is certainly the simplest scraping method to try.
Let’s use it to scrape the list of recent audit reports from the California state government, at https://www.bsa.ca.gov/reports/recent.
- Create a new workflow and choose Scrape Table as your data source.
- Paste the URL of the page into the URL field https://www.bsa.ca.gov/reports/recent
- Press Scrape
You can see the completed workflow here.
If there is more than one table on the page, you may need to set the Table’s position on the page number to 2 or higher. You may also find tables where the column names incorrectly appear as the first row of data, which you can fix by selecting the option First row is header.
Many simple web pages publish tables that can be scraped using this method. It works for most tables in Wikipedia, and even large tables like this list of housing permits published by the City of San Jose.
Monitoring a page for changes
Scraping is often used for monitoring. In the page above, we’re interested in knowing when a new report is published.
Like other data sources on Workbench, you can set your scraper to Auto in order to run it on a schedule.
- If new data is found, it will be automatically saved as a new version.
- You can choose to have an email notification sent to you.
2. When Scrape Table doesn’t work, try Xpath
Sometimes the information on a page might look like a table, but the HTML might be coded in some different way. For scraping purposes, we need to deal with how the page is coded, not the way it looks. If you like, you can see how a page is coded by using the web inspector built into your browser.
Let’s suppose we want to scrape the University of British Columbia’s anthropology course calendar, which looks like this:
If the data on your page is not within an HTML <table> element, Scrape Table will tell you:
More advanced scraping jobs require two steps: Scrape HTML downloads and save the HTML source code of one or more pages, which contains all the data, and more. HTML to Table extracts the actual data we’re interested in.
To scrape our anthropology course calendar:
- Start with Scrape HTML, enter the URL, and press Scrape. The result of this step is a single row, containing the complete HTML source for that page (as well as some other information we’ll get to below.)
2. Add an HTML to Table step and choose the Xpath selectors method.
Workbench needs you to specify what content from the page should go in each column of your new table. Each column is defined by an “XPath selector” which is a short piece of code written in a special language designed for selecting parts of web pages.
Xpath is a language based on the tree structure of HTML, which you can view using your web browser’s inspector tool .You can learn how to do all sorts of amazing things with xpath here, but this can get complex.
For selecting content on a webpage, there is a simpler way: The SelectorGadget browser extension lets you click on the elements you want in your browser, and automatically writes the necessary XPath selectors for you.
- If you use Chrome, you can install the SelectorGadget browser extension.
- On other browsers, you can install it as a bookmark by following these instructions.
Watch this short video to learn how to use Selector Gadget, or follow the instructions below it.
- After installing the extension and activating it on our anthropology course calendar (in Chrome, press the little magnifying glass icon in the upper right) we can click on one of the course titles to select the elements we want.
- Once all the elements you want to get are selected in green or yellow, click on the XPath button to get the code that Workbench needs to extract these elements. In this case, the xpath for the title is //b.
- Paste it back into Workbench, in the XPath selector field. Name the column Title.
- Click the play button to extract this information from the previously scraped HTML.
You scrape the course descriptions in the same way. Click +ADD to create a new column, then repeat the above steps. You should end up with an xpath of //dd for the descriptions.
You can see this scrape in action here.
3. Scraping multiple pages
Suppose we want to scrape this list of audit reports from the Colorado state government, at https://leg.colorado.gov/audit-search. It’s possible to isolate the titles, release date, report numbers, and so on using Selector Gadget.
But there’s a problem: this list spans multiple pages, which are accessed using the Next button at the bottom. If we paste https://leg.colorado.gov/audit-search into Scrape HTML we will get only the first page of results.
Fortunately, it’s possible to scrape multiple pages using Scrape HTML, if the page number appears in the URL.
- Go to https://leg.colorado.gov/audit-search and press Next at the bottom of the page.
- Notice that the URL changes to https://leg.colorado.gov/audit-search?page=1
- Paste the URL without the page number into Scrape HTML.
- Select the Series of numbered pages checkbox.
Notice that the first page number is zero by default, not one. Like many sites (but not all) leg.colorado.gov counts pages from zero, so when you go to the second page the URL ends with page=1.
- We’ll scrape the data from page 0 to 9 (For now, the limit is 10 pages).
- Press Scrape. Workbench will download the HTML file for each page (it can take a few minutes), and produce a table with one row per page scraped. The columns show which URLs were scraped, when, if any errors occurred (“200” means “no errors”) and finally the raw HTML.
We still need HTML to table to extract the information we want from all this raw HTML source code.
Use Selector Gadget to set up each column you want to scrape, and HTML to Table will automatically combine the results from all pages.
Here is the completed workflow.
4. Extracting link URLs with XPath
Xpath is so flexible you may want to use it even when the simpler Scrape Table method works. For example, you can extract the URLs from links with the xpath method, while Scrape Table only extracts the link text.
Let’s return to our colorado.gov page.
Each title links to a page with a summary of each report and a download button. Let’s collect all the links to the download pages, not just the title text. If you’ve already got the link text extracted, this requires only one small change to the XPath selector.
- Copy the XPath from the column title, paste it in a new column and Add /@href to it.
Note that the extracted URLs are relative to the HTML <a> elements , meaning that they look like /audits/division-youth-services-reporting. To turn them into a URL that you can use in a browser or feed to another HTML Scraper step, we need to prepend https://leg.colorado.gov to each link. You can do so with a Formula step.
You can see these steps all put together in final form, extracting the links from the first ten pages, in this live workflow.
A Growing Toolkit
With the set of tools described in this post you will be able to scrape many different kinds of sites, especially if you invest some time learning how to read the structure of HTML and how XPath works. But there are cases where more advanced tools are needed, such as sites which require you to submit a form or a search query before they show results.
We are building out a suite of progressively more powerful scraping tools in Workbench — and you can help: Get in touch, and send us links to pages you need to scrape, with just a few words to give us context.
We’d love to hear about your scraping problems!