What Workbench can do for data

Workbench
5 min readMay 8, 2018

We are building Workbench to be a simpler, more powerful way to work with data, especially for journalists. Data journalism involves steps such as getting the data, cleaning it, and visualizing it. It also requires many other tasks that might be less obvious, like checking for updated data, scraping tables out of web pages, and sending a completed story to your editor for review.

There are tools for each of these; there is no tool that does all of these together, without requiring any coding — and automatically records every step in a readable, reproducible, sharable “workflow.” Workbench is currently in private beta (sign up here!) but it already does some a lot of things. You can:

  • Scrape data from web pages
  • Clean data using an Open Refine-like interface
  • Monitor live data for changes
  • Use spreadsheet formulas and regular expressions
  • Export live charts and data
  • Build and share custom processing modules

Scrape data

A data tool that you can’t get data into isn’t much good, which is why Workbench has a variety of Add Data modules.

You can upload files, grab data from a public URL, load Google Sheets document, or just paste in data. But sadly, much of the world’s data is locked into web pages that were never designed to be machine-readable. If you’ve used Google Sheets, you may have used the ImportHTML command to read tables and lists off of web pages. Workbench Scrape Table module works exactly the same way:

Here the “4” means scrape the 4th table on the page, which is Wikipedia’s list of the longest rivers:

You can also use HTML Scraper to scrape a list of URLs, saving the HTML for each one. Full xpath support, like Google Sheets’ ImportXML, is coming soon.

Clean data

The data that journalists work with is messy. Really messy. It comes from literally everywhere and is full of all sorts of errors. If you’ve been working with data for a while, you’ve probably discovered Open Refine. It’s best known for its faceting and clustering features which make it easy to standardize values and fixed types. So, we made our own.

The Refine module shows a list of all values in a column, sorted by how many times each appears. It’s useful both for cleaning and filtering. You can include or exclude any value (perhaps we don’t want the blank rows) or edit all rows with the same value at the same time, by double clicking (we can change every “Seattle” to “Seattle -Tacoma.”)

Refine is the crown jewel of Workbench’s cleanup tools, which also include modules for filtering values, dropping empty rows and empty columns, splitting and merging columns, filling empty cells, or just directly editing values.

Monitor data sources for changes

Every Add Data module has an “update button” that checks for new data. Or, you can make Workbench check for you automatically.

Previously loaded data is automatically versioned and saved, so you can go back to it later — for example if the data you were using was deleted! And when Workbench finds that the data has changed, it can notify you by email.

Formulas and regular expressions

Maybe you’re a spreadsheet genius and you know all about how to use formulas to clean or calculate. Or maybe you’ve mastered the art of regular expressions to search through text or extract information. Or perhaps you’d prefer Python? Workbench supports them all.

Here’s an Excel formula to extract the first word of every value in the second column. In Workbench, formulas are automatically “filled” downward, so a formula creates an entire new column, not just one cell. The Formula module also supports Python syntax if that’s what you prefer.

And here’s the same thing using our Regular Expression Extractor:

Export live charts and data

Workbench currently has basic column and line charts you can embed live into your own pages. Just click on the embed icon (</>) to get code you can paste into your HTML pages, or your CMS.

This chart will automatically update whenever the workflow does, and reflect changes in the data data source, if you’ve set your Add Data module to do so. Of course, you can simply download a static PNG or SVG file.

In fact, every single module has a live API endpoint which serves CSV or JSON, which you can access from the menu on top right of each module. You can use this to feed the output of Workbench to other systems — perhaps use it to drive a live interactive visualization.

Build and share custom modules

Workbench is designed to get as much as possible done without coding. But there will always be places where only programming will go. You can always just paste in some code to create a custom processing step.

Code is better when it’s shared, and so you can package it up to create a custom module, and import it into the Module Library for others to use. This way, developers can support specific teams and projects, and contribute back to the Workbench community.

Custom modules can even produce custom visualizations! Think of Workbench as a package manager for user-friendly data processing tools.

Try it yourself

Workbench is currently in private beta, and you can sign up here. Or, join us as a developer to create your own modules. Either way, we hope you’ll share our excitement for this powerful new way to work with data.

--

--

Workbench

The data journalism platform with built-in training. A project of Columbia Journalism School