There’s a lot of data out there. Online, offline, in the public cloud or in the private cloud, locked in PDFs, and of course on social media. A lot of formats too, and so called ‘standards’ fighting for dominance: API protocols, query languages, database types and file formats.
For many skilled journalists, the investigative process often starts with a few technical questions: How to access the data? When is it updated? What tools should I use to clean it, and which one to analyze it and make a chart? Not every newsroom has a data team the size of the Upshot, and developers are overwhelmed with requests.
Workbench is a platform designed to help journalists find stories in data, without coding.
In this post, we’ll take look at how you can use Workbench’s modular environment to access and monitor data: First we’ll load data from a government API to create an embeddable chart that updates when its data source changes. Then, we’ll built a custom scraper that sends you email notifications every time a new audit report is published on a specific webpage.
Create your account here.
I. Build a live chart using governmental data
Every quarter, San Francisco publishes a snapshot of all the buildings currently in the process of development, from initial proposal to completion. The city is going through a notorious housing crisis, so let’s use this data to visualize how many affordable and market-rate housing units are currently in planning process.
This is what you’re going to build: https://app.workbenchdata.com/workflows/1961/
Connect to a database via a public API
- Add the module
Add from URL
.
- In a new tab, open the dataset page and copy the API endpoint URL, as shown here:
- Paste in the URL into the module:
https://data.sfgov.org/Housing-and-Buildings/Affordable-Housing-Pipeline/aaxw-2cb8
- Click
Update
to load the data. You are now connected to City’s database!
Group projects per neighborhood, and calculate the sum of affordable and market rate units for each of them.
- Add the module
Group
belowAdd from URL
. - Choose the column
Neighborhood
to group by. - Under Operation, select
SUM
and apply it to the columnmarket_rate_units
. - Add a second operation, select
SUM
and apply it to the columnaffordable_units
Create a chart
- Add the module
Column Chart
using the search. The first column will be interpreted as the X axis, and all other columns as series
Monitor changes in the source data
The city publishes a new dataset every quarter, without providing an exact date. Fortunately, you can automate the workflow to check the source periodically, send you an email if new data has been found, and update the chart.
- Back in the Add from URL module, Click
manual
to open the Workflow Update dialog. Turn the feature on, set the frequency to1 day
, and check the option to receive email.
Your chart is live! Every time new data is found, the chart will be updated and you will receive an email.
II. Create a custom scraper to monitor when a document is published online
The City’s auditor Office website of the City of Atlanta regularly publishes audit reports on its website. Those reports are instrumental to understanding the state of the city’s administration, but are always announced.
Let’s scrape the existing list of reports published and set up an alert to receive an email notification if that list ever changes.
Choose the web page to scrape
- Create a new workflow and add the module ‘Scrape with XPath’.
- Paste in the following URL, where all the public reports are listed.
http://www.atlaudit.org/audit-reports.html
Find the XPath selector to target the list
You need an “XPath selector” to tell Workbench which parts of the web page to scrape. Modern web browsers include “Inspector” tools to help.
- In a new browser tab, scroll down to the “list of all publications released by the City Auditor’s Office”. Right-click on the first link in that list, and choose Inspect in the menu to open the Inspector panel.
- In the Inspector panel, the
<a>
element corresponding to the link you are inspecting is highlighted. Right-click that element and choose Copy and then Copy XPath.
- Back in Workbench, paste the XPath selector you just copied into the module, and click
Scrape
.
You successfully scraped one link from the list. In order to monitor if the list changes over time, we now need to scrape all links in our list.
Edit the XPath selector to scrape the entire list
Let’s look closely at the HTML element in the browser’s inspector:
The <a>
tag we selected is nested within a <li>
parent, which is itself nested within a <ul>
.
Now look closely at the XPath selector:
//*[@id=”wsite-content”]/div[3]/ul/li[1]/a
ul/li[1]
means “Select the first <li>
inside the <ul>
". If we delete [1]
, all the <li>
elements within the parent <ul>
will be selected, and thus every <a>
tags within each <li>
.
- Edit the XPath you copied in the module as follow in order to select all the items in the list:
//*[@id=”wsite-content”]/div[3]/ul/li/a
- Click
Scrape
4. Monitor for changes
Finally, let’s set up the module to monitor the webpage and automatically scrape it the list is updated.
In the module, click Manual to open the Workflow Update window. Turn the feature ON, choose a frequency and check Email me when data changes
.
Click Apply
.
Your custom scraper is set! If new elements are published or deleted within the selected list, the new version will be scraped and Workbench will notify you by email.
These are just two examples of how Workbench can help you work with data without coding. Check out our integrated tutorial section to learn more by doing.