Beyond spreadsheets: Imagining the data journalism workflow of the future

The spreadsheet was a great innovation —in 1979 — but there’s so much that spreadsheets don’t do. Today we’re happy to announce Workbench, our next generation data journalism platform. Try out these demos to see how it works. For this first release, we’re concentrating on monitoring data for stories, and you can read more about that here. But in this post we want to talk about the big picture — what’s wrong with the way journalists work now, and how it could change for the better.

Create your account here.

So let’s imagine how a data journalist might get a story done in the post-spreadsheet era. The topics that data journalists cover may change, and we’re sure to see some amazing new interactive experiences, but the process of creating a story will always include the basic steps of getting data, cleaning it, and analyzing it. Ideally, journalists will also publish this process, to make their story transparent and reproducible. And of course, reporters will still need to learn how to do work with data. We’ve re-imagined every one of these steps, and Workbench is our working prototype.

City budget, some time in the future

Meet Sue, a newly hired local reporter covering city politics. She’s just learned that this year’s city budget will be published next month (though it’s impossible to say exactly when, because City Hall is notoriously unpredictable.) Her editor has asked for an interactive visualization of the new budget, perhaps a classic treemap.

An example city budget treemap (Seattle, 2014)

The only problem is, Sue hasn’t really worked with data before. Fortunately, there is a fantastic online training program that uses an integrated environment to teach data journalism. It’s kind of like the polished learn-to-code online courses of today, but it’s not a coding course because many data journalism tasks that used to require coding are now handled by more accessible and efficient tools. (Sue’s elders like to remind her that it used to require coding just to publish a page on the web!) It’s not spreadsheet training either. Instead, Sue learns to work in a new kind of data processing environment built around the dataframes used in professional software such as R and Pandas. The course teaches hard skills plus more abstract computational thinking, and includes tasks that are important in journalism such as scraping, cleaning, and monitoring data.

Workbench already includes an integrated training system that guides you step by step through using the software to accomplish specific tasks. This summer we are piloting a Workbench-based course at Columbia Journalism School, with the ultimate goal of developing the first end-to-end self taught data journalism training program.

Workbench includes integrated training

Long tail data monitoring

Sue is gaining confidence in her ability to work with data, but the new City budget isn’t out yet. What she needs is a notification when it’s released. There are monitoring services for popular national data, such as unemployment figures and campaign finance disclosures, but the long tail of millions of local government data sets aren’t covered. Fortunately, her data journalism platform offers a comprehensive suite of do-it-yourself monitoring tools. She’s able to set up an alert using a simple web scraper that watches for new data sets posted to the city government blog. (Even in the future, local government IT infrastructure is stuck in the past.)

Workbench can already monitor many kinds of data sources, such as streams from Twitter, your colleague’s shared Google Drive folder, or a flaky local government site. It even duplicates the functionality of Google Sheets’ IMPORTHTML and IMPORTXML to scrape tables and lists from pages. You can get an email alert when any data source changes, or set up custom filters to notify you only under certain conditions — like when the fire department posts three or more tweets within ten minutes. It’s a fully programmable monitoring tool, except you’ll never need to write any code.

Workbench’s notification settings

The big day

One afternoon, Sue’s phone buzzes. The budget numbers are out! She downloads the new data and immediately discovers it’s in the wrong format, needs to be standardized, contains a bunch of typos, and all department names are cryptic abbreviations. This will need a lot of cleanup before it can become a visualization.

So she sets to work. After loading the data into her editing environment, she removes the information she doesn’t need, and reshapes the table from long to wide format. She then standardizes the ‘Expense’ column by using the Refine tool to do some bulk category editing. She cleans up typos and other errors, and groups related budget lines into larger categories that her readers will better understand. Then she does a “join” operation with a table that translates department abbreviations into full names. Finally, the data is clean enough to drop into a treemap.

The Refine module can be used for rapid renaming and filtering

This is a lot of complex editing, but no data is ever lost. Instead, the software records every step as Sue works. She can go back at any time to see what she did or to try something else, and she has a living record of her entire process. For example, it’s easy to see exactly which budget items she merged into the “administration” category. She sends this full workflow — not just the final chart — to her editor for double checking and approval.

Workbench is built on the concept of transparency as a side effect. As you do your normal work with data, you build up a “workflow” which is a list of every step of the process. And Workbench already includes a powerful suite of data cleaning tools, including the Refine module.

Show your work, share your work

Sue understands that trust comes from transparency. After all, the readers of her story should be able to tell exactly how her visualization was generated, even trace every step if they want to. That’s why Sue publishes her workflow along with the visualization. Readers can click the “source” link to examine her process in a live environment.

As she watches her story spread on social media, Sue thinks about how hard this sort of transparency used to be. In the bad old spreadsheet days you had to write up a separate “methodology post”! Or you could do all your work with code, but then readers couldn’t understand it.

Meanwhile, in another city, Chris has just been assigned to cover the new city budget. He’s never done a city budget before either, but he knows a fast way to learn how. Most data stories include a link to the source workflow, so he searches for “city budget treemap” and finds Sue’s story. He opens up Sue’s workflow, and clicks “Duplicate” to start his own story. It’s just like forking a project on github, he thinks, except it’s so much easier…

It’s not fiction

Today is the first release of Workbench, the data journalism platform with all the features described above: integrated training, configurable monitoring, transparent workflows, and a duplicate button. We’ve been working with journalists in newsrooms large and small, from SF to Karachi and beyond, to understand what needs to change about the way data journalism is done today.

We hope that you’ll try Workbench and join our community. Send us your gnarliest data journalism problems and we’ll see if we can make them smoother. Share your successful workflows so other journalists can learn. Write a plug-in to extend Workbench’s capabilities for everyone, or join us on Gitter to get involved in this open source project.

We have a long way to go before Workbench is powerful enough to execute all the stories journalists need to tell, and an even longer way to go before data journalism is as easy to learn, easy to do, transparent, and sharable as it should be — and we hope you’ll be a part of making that happen.

Workbench is an open source project of Columbia Journalism School with the goal of making sophisticated data journalism accessible and transparent. Sign up here.