Open Data requires democratizing analysis of that data

If you throw your data onto a public ftp server, does it make a sound? Too often, Open Data is assumed to be only about legality — if you free it from copyright and license restrictions, and make it available for download, the thinking goes, your data are now open. Yet, data that you release by simply throwing it onto a public server tends to be not all that useful. The most successful Open Data initiatives are those that empower people who were not using that data in the first place.

There are five aspects to empowering citizens, scientists and businesses with open data:

  1. Machine Readability. The most powerful uses of data are those enabled by automated applications that provide the data to end-users in a context that is immediate and understandable to those users. Publishing data in the form of PDFs or graphs that can not be transformed and embedded in other contexts is not that useful.
  2. Quality Control. Many datasets are inherently complex and come from domains where there is a lot of subtlety. Yet, to empower users with that data, the data will have to get into the hands of developers and other laypersons who will not have that domain savvy. Consistent, thorough, and transparent quality control is a must so that the data are not misused.
  3. Ease of use. If someone needs to expend sweat, tears and blood in order to carry out simple analysis on your dataset, your Open Data initiative is not meeting its full potential. This has particular implications on the format of your data, how you partition it and whether people can analyze the data in situ. Ideally, people should be able to interact with your data without installing any software or provisioning any hardware — they should not have to download your data in order to compute with it.
  4. Support science. Ease of use and quality control should not be at the expense of serious scientific uses of the data. You should not “dumb” the data down in an effort to make it easy to use. It should be possible to load the data into statistical packages (but without having to download the entire dataset or having to spin up a cluster of machines). This too has implications on the format in which you distribute your data and the documentation you have in place around it.
  5. Collaboration and verifiability. It should be possible for citizens and scientists who carry out analyses on your dataset to easily disseminate both their methods and their results in a form that makes it easy to replicate and verify.

For a concrete example, take the Global Historical Climate Network (GHCN) dataset. The GHCN data is already freely available and machine readable (as comma-separated text files) and has had good quality control applied to it to ensure that the data are spatially and temporally coherent. However, it didn’t quite meet its potential in terms of the other three aspects.

At Google, we decided to empower anyone with just a web browser to be able to carry out queries on this dataset — they don’t need to provision hardware or install software. In other words, they could bring their computation to the data. Thus, you can now go to http://bigquery.cloud.google.com/, type in a query (sample queries are shown in this blog post on GHCN data) and get the results. The results can be exported into a text file or into a Google Sheet and shared with anyone in the world. Until the volume of data you process reaches one terabyte, it’s all free. That takes care of repeatability and publishing.

While our sample queries can address some of the problems related to ease of use, the GHCN data is (in the words of a Google engineer who looked at its schema) somewhat goofy. Every observation gets its own line in the .csv file, and we loaded the data as-is into BigQuery. This makes the queries really verbose, inefficient and ultimately Hard. But we also show you how to create a View (see our blog post and the description page for more details) that reformats the data on the fly into something that is a lot less goofy and error-prone.

But what about serious, scientific use of the data? We have that covered as well. CloudShell is your free Linux home directory on the Cloud — you get 5GB of “home drive”, plus a microVM accessible as a tab in your web browser. From CloudShell, you can launch a Docker container that will run Datalab, a Jupyter notebook that knows about BigQuery and other tools on Google Cloud Platform. In fact, here’s the Datalab notebook that underlies the blogpost about GHCN. So, you can do analysis of the data in Python on Google Cloud to your heart’s content. As long as the microVM suffices in terms of processing power, it’s all free as well.

For data to be truly open, the analysis of that data needs to be easy, repeatable and support the full range of users, from laypersons to scientists. As datasets move to the cloud, we are increasingly living in Jim Gray’s world — as he originally envisioned, allowing users to bring their computation to data (instead of having to move the data to a place where you can compute with it) makes it possible to create easy-to-use and interoperable datasets that support a wide variety of users and uses.

Note: Everything here is a personal view. I do not speak for Google.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.