A different approach to transparent data journalism

“Show your work!” has been the data journalists’ rallying cry for many years. It just hasn’t been easy to do. Workbench is designed to make data journalism radically more transparent, but it’s not a transparency tool — instead, it’s a powerful data journalism tool that just happens to create transparent notebooks as you work.

A workflow which produces a live chart of the number of times Trump tweets each day

We are hardly the first to suggest that data journalism would be better if everyone could see exactly how it worked. Does that story on unemployment count people who have given up looking for work? How exactly does that analysis of surgical complication rates work? Rather than just trusting that a statistic means what a visualization says it means, or that an analysis is appropriate, it should be possible — and easy — for readers to check the reporter’s work. It’s also important for editors to be able to understand what a reporter did on a data story before it’s published, and the whole community benefits when reporters are able to share and learn from each others’ work. Just like a citation, not everyone will care to follow the link. But the mere fact that the work is publicly documented builds trust.

But transparency, and its close cousin reproducibility, are hard to achieve. Current reproducible data journalism tools and workflows are code-based, built on top of R Studio (like this and this), Jupyter notebook (like this), or the command line (like this and this). Transparency requires extra work, because you have to publish not just the story but all of the data and code behind it. Many news organizations already publish the code behind their stories, but it requires a commitment to reproducible scripts, and often extra work.

Workbench takes a different approach: it’s a platform for all stages of data journalism, from scraping to monitoring to cleaning to visualization, where everything is automatically reproducible. You can work with your data by stacking a variety of “modules” that operate on tables (or dataframes, as they’re called in Pandas and R.) Or you can just sort, filter, and edit the data in the spreadsheet interface. Every action you take, such as editing a cell value or applying a filter, is recorded as you work. The resulting “workflow” is a live, sharable, top-to-bottom record of how your raw data became a story. You can add notes to make the process even clearer, turning the workflow into a Jupyter notebook-style data document. While you can add Python code if you want, Workbench can handle most data journalism tasks without code. The workflow is designed to be understandable for non-technical readers, editors, and colleagues.

Every spreadsheet edit (or sort, or filter, or…) becomes a reproducible step

This approach gets around two huge barriers to transparency and reproducibility. First, reproducibility usually requires code. It’s hard to exactly redo an analysis in Excel, but easy to run a script again. Unfortunately, this requires the ability to write that script in the first place — including doing data analysis programmatically in a language like R or Python. Second, reporters are very busy people! The job has only gotten more demanding and complex in recent years as one person is often expected to master a variety of skills such as reporting, writing, data analysis, web production, and engaging readers on social media. Transparency requires publishing a separate record of the data analysis process, in addition to the story, and no one wants extra work!

Instead of asking journalists to learn new skills and do more work, Workbench is designed to make their life easier. It includes core tools like sorting, filtering, and charting. But spreadsheets already do all of that well. The main attraction is help with all the other parts of story production, all the necessary operations that are often difficult to do without coding. Workbench includes a variety of customizable scrapers, live data monitoring and alerts, a cleaning tool that operates similarly to Google Refine, and so on. For example, you could create a workflow that monitors the Twitter account of the local emergency services department and notifies you when there are three or more tweets within ten minutes. Or you could upload messy campaign finance records to Google Sheets and feed them to a workflow that does automated standardization of names and addresses. Workbench can already do many things, and you can extend it by creating new modules. Transparency and reproducibility are major goals of the Workbench project, but that’s not the promise we are making to reporters.

We want every data journalist to show their work. But any way you slice it, that means a big change to existing workflows, and no one has time for that. Transparency itself is not a great motivator when you’re overworked. So a transformative transparency tool has to be, first of all, a transformative data journalism tool.

Workbench is an open source project of Columbia Journalism School with the goal of making sophisticated data journalism accessible and transparent. Sign up here to try the beta.