Help me use your data

I've been interviewed a couple of times recently by people interested in understanding how best to publish data to make it useful for others. Once by a start-up and a couple of times by researchers. The core of the discussion has essentially been the same question: “how do you know if a dataset will be useful to you?”

I've given essentially the same answer each time. When I'm sifting through dataset descriptions, either in a portal or via a web search, my first stage of filtering involves looking for:

  1. A brief summary of the dataset: e.g. a title and a description
  2. The licence
  3. Some idea of its coverage, e.g. geographic coverage, scope of time series, level of aggregation, etc
  4. Whether it’s in a usable format

Beyond, that there’s a lot more that I'm interested in: the provenance of the data, its timeliness and a variety of quality indicators. But those pieces of information are what I'm looking for right at the start. I’ll happily jump through hoops to massage some data into a better format. But if the licence or coverage isn't right then its useless to me.

We can frame these as questions:

  1. What is it? (Description)
  2. Can I use it? (Licence)
  3. Will it help answer my question? (in whole, or part)
  4. How difficult will it be to use? (format, technical characteristics)

It’s frustrating how often these essentials aren't readily available.

Here’s an example of why this is important.

A weather data example

I'm currently working on a project that needs access to local weather observations. I want openly licensed temperature readings for my local area.

My initial port of call was the Met Office Hourly Site Specific Observations. The product description is a useful overview and the terms of use make the licensing clear. Questions 1 & 2 answered.

However I couldn't find a list of sites to answer Question 3. Eventually I found the API documentation for the service that would generate me a list of sites. But I can only access that with an API key. So I've signed up, obtained a key, made the API call, downloaded the JSON, converted it into CSV, uploaded it to Carto and then made a map.

And now I can answer Question 3. The closest site is in Bristol and so the service isn't useful to me at all. Time wasted, but hopefully not all the effort because now you can just look at the map. But the Met Office could simply have published a map. There is one of the whole network, but they don’t all contribute to the open dataset.

So I started to look at the OpenWeatherMap API. They also have an API endpoint that exposes weather data for a specific station. Or stations within a geographic area. But again, they've not actually published a map that would let me see if there are any local to me. I might have missed something so I've asked them.

In both cases I'm having to get into invest time and some technical effort in answering questions which should be part of the documentation. They could even use their own APIs to create an interactive map for people to use!

As a result I'm going to end up using wunderground. By browsing the user facing part of their site I’ve been able to confirm there are several local weather stations. And hopefully these will be exposed via the API. (But I’m going to have to dig a bit to check on the terms of use. Sigh.)

If you really want me to use your data then you need to help me to use it. Think about my user experience. Help me understand what your dataset contains before I have to actually poke around inside it.


Originally published at blog.ldodds.com on August 31, 2016.