Monitoring data sources to find stories with Workbench

Workbench
5 min readJun 28, 2018

--

There’s a lot of data out there. Online, offline, in the public cloud or in the private cloud, locked in PDFs, and of course on social media. A lot of formats too, and so called ‘standards’ fighting for dominance: API protocols, query languages, database types and file formats.

For many skilled journalists, the investigative process often starts with a few technical questions: How to access the data? When is it updated? What tools should I use to clean it, and which one to analyze it and make a chart? Not every newsroom has a data team the size of the Upshot, and developers are overwhelmed with requests.

Workbench is a platform designed to help journalists find stories in data, without coding.

In this post, we’ll take look at how you can use Workbench’s modular environment to access and monitor data: First we’ll load data from a government API to create an embeddable chart that updates when its data source changes. Then, we’ll built a custom scraper that sends you email notifications every time a new audit report is published on a specific webpage.

Create your account here.

I. Build a live chart using governmental data

Every quarter, San Francisco publishes a snapshot of all the buildings currently in the process of development, from initial proposal to completion. The city is going through a notorious housing crisis, so let’s use this data to visualize how many affordable and market-rate housing units are currently in planning process.

This is what you’re going to build: https://app.workbenchdata.com/workflows/1961/

Connect to a database via a public API

  • Add the module Add from URL.
  • In a new tab, open the dataset page and copy the API endpoint URL, as shown here:
Access the database through using the API endpoint’s URL
Each row is a project going through the city’s planning process

Group projects per neighborhood, and calculate the sum of affordable and market rate units for each of them.

  1. Add the module Group below Add from URL.
  2. Choose the column Neighborhood to group by.
  3. Under Operation, select SUM and apply it to the column market_rate_units.
  4. Add a second operation, select SUM and apply it to the column affordable_units
Projects are now grouped by neighborhoods

Create a chart

  • Add the module Column Chart using the search. The first column will be interpreted as the X axis, and all other columns as series

Monitor changes in the source data

The city publishes a new dataset every quarter, without providing an exact date. Fortunately, you can automate the workflow to check the source periodically, send you an email if new data has been found, and update the chart.

  • Back in the Add from URL module, Click manual to open the Workflow Update dialog. Turn the feature on, set the frequency to 1 day, and check the option to receive email.

Your chart is live! Every time new data is found, the chart will be updated and you will receive an email.

II. Create a custom scraper to monitor when a document is published online

The City’s auditor Office website of the City of Atlanta regularly publishes audit reports on its website. Those reports are instrumental to understanding the state of the city’s administration, but are always announced.

Let’s scrape the existing list of reports published and set up an alert to receive an email notification if that list ever changes.

Choose the web page to scrape

http://www.atlaudit.org/audit-reports.html

Find the XPath selector to target the list

You need an “XPath selector” to tell Workbench which parts of the web page to scrape. Modern web browsers include “Inspector” tools to help.

  • In a new browser tab, scroll down to the “list of all publications released by the City Auditor’s Office”. Right-click on the first link in that list, and choose Inspect in the menu to open the Inspector panel.
Right click and ‘inspect’ a link to open the inspector
  • In the Inspector panel, the <a> element corresponding to the link you are inspecting is highlighted. Right-click that element and choose Copy and then Copy XPath.
Right click an element to copy its XPath selector
  • Back in Workbench, paste the XPath selector you just copied into the module, and click Scrape.
Result after scraping using the XPath selector of the first element in the list

You successfully scraped one link from the list. In order to monitor if the list changes over time, we now need to scrape all links in our list.

Edit the XPath selector to scrape the entire list

Let’s look closely at the HTML element in the browser’s inspector:

In HTML, elements are are nested within each others

The <a> tag we selected is nested within a <li> parent, which is itself nested within a <ul>.

Now look closely at the XPath selector:

//*[@id=”wsite-content”]/div[3]/ul/li[1]/a

ul/li[1] means “Select the first <li>inside the <ul>". If we delete [1], all the <li> elements within the parent <ul> will be selected, and thus every <a> tags within each <li>.

  • Edit the XPath you copied in the module as follow in order to select all the items in the list:
//*[@id=”wsite-content”]/div[3]/ul/li/a
  • Click Scrape

4. Monitor for changes

Finally, let’s set up the module to monitor the webpage and automatically scrape it the list is updated.

In the module, click Manual to open the Workflow Update window. Turn the feature ON, choose a frequency and check Email me when data changes.

Click Apply.

Your custom scraper is set! If new elements are published or deleted within the selected list, the new version will be scraped and Workbench will notify you by email.

These are just two examples of how Workbench can help you work with data without coding. Check out our integrated tutorial section to learn more by doing.

Integrated tutorial in Workbench

--

--

Workbench

The data journalism platform with built-in training. A project of Columbia Journalism School