Building tools to bring data-driven reporting to more newsrooms

How I aim to use my fellowship to develop an open-source ecosystem of tools for data journalism

Simon Willison
JSK Class of 2020
4 min readDec 19, 2019

--

I applied for the JSK Fellowship with the following proposal: How might we grow an open-source ecosystem of tools to help data journalists collect, analyze and publish the data underlying their stories?

My starting point for this question was my Datasette open-source project. Datasette is a tool for exploring and publishing data. It provides an interface for exploring small or large datasets, an API for integrating that data into custom applications and a collection of tools for publishing that data to the internet.

I designed Datasette based on my experience working with the Guardian Datablog team at the Guardian from 2008 to 2010. I was confident that it could solve the problems I was tackling a decade ago, but my first priority for the fellowship was to find out whether it addressed the challenges faced by journalists and data journalists today.

Three months into the fellowship, I’m confident that the answer is yes. I’ve been collaborating with teams of journalism students, talking with other fellows and I’m starting to spin up some collaborations with journalists outside of Stanford. I’m seeing a strong signal that the tools I have built so far are applicable to the day-to-day work of data-driven reporting.

Modern reporters have many different avenues for gathering data to support their reporting: FOIA requests, government open data portals, leaks or carefully constructed scrapers. But once you’ve gathered the raw data, what’s step two? This is where I aim to help: I want to reduce the space between “I have the data” and “I can now view, analyze and share that data” as much as possible.

My current toolchain has two steps: first, convert the data from whatever format it’s in to SQLite, a fast, free, widely available database. I have a growing collection of tools for handling this conversion. Second, use Datasette itself to explore, publish and visualize the resulting databases.

Evolving Datasette through side-projects

An important tenet of software design is eating your own dog food — actively using the tools you are building in order to best understand how they work and what needs to be improved.

I’ve always loved side-projects, and Datasette dramatically reduces the cost of building something new. I’ve been taking full advantage of this during my fellowship so far, and the projects I’ve been building have really helped me figure out where to go next with Datasette. the project.

My three most significant side-projects so far have been PG&E Outages, Dogsheep and Niche Museums

PG&E outages

At the start of October, California’s largest electricity provider PG&E started cutting off power to wide swathes of the state as a precautionary defence against wildfires. I started scraping PG&E’s outage map in June, and I used Datasette to publish and analyze the data I had collected. I wrote up full details of the project in Tracking PG&E outages by scraping to a git repo —here’s a heatmap of the outages (rendered using Kepler) based on that data:

PG&E outages in California from October 1st 2019 onwards, rendered as a heatmap

Dogsheep

Dogsheep is a collection of tools I’ve been building for personal analytics. I’m exporting my data from services such as Twitter, Google, LinkedIn, Foursquare, GitHub and Apple HealthKit and loading it into a private Datasette instance so I can analyze and visualize it over time. It turns out these services generate a lot of data, so this has been really useful for testing Datasette against larger databases.

Niche Museums

My favourite hobby is finding and exploring niche museums — tiny museums about very specific subjects. Once you start looking for them they turn up everywhere: the United States has 15,891 branches of Starbucks but at least 30,132 museums!

Niche Museums is a new website I am building to document my explorations. I try to post a new museum every day, and under the hood the site is entirely powered by Datasette.

Building the site in this way has helped me explore all kinds of interesting potential directions for the Datasette ecosystem. Just one example: I builtd a new plugin called datasette-atom to help implement an Atom feed for the site.

Next steps: hosted Datasette

To take advantage of the tools I’m building, journalists need to be able to use them.

The biggest challenge I’m seeing right now is around usability. Running my tools requires first installing them on a laptop, and then understanding how to use the command-line terminal to run them. This is not a trivial thing to learn to do!

My focus for the next few months will be around addressing this problem. I plan to provide a hosted version of Datasette aimed at small newsrooms, so I can provide a secure, pre-configured instance of the tool for newsrooms that want to try it out. I also plan to build versions of my conversion tools that run as part of that web application, so journalists can upload CSV files and other formats using a web browser and have them converted automatically and made available within their private Datasette.

Want to get involved?

One of the great challenges of open-source software is that because it’s available for free people often use it without letting you know what they are doing with it.

I’m desperately interested in your feedback! If you try out any of the tools I’m using, please let me know how it goes. If you’re interested in using them but don’t know how to get started, please get in touch. And if you want to be first in line to try out the hosted version, you can contact me at swilliso@stanford.edu

--

--