Small Data Metadata

a Founding Principle, not an Afterthought

Small data are the digital traces that individuals generate as a byproduct of daily activities, such as sending e-mail or exercising with fitness trackers. [source: Workshop on Small Data Experimentation, call for papers]

In this post, we advocate for the need to standardize metadata about small data and build tools to create, manage, process and reason about it. Not only is small data metadata critical to foster the small data ecosystem of consumers and producers, it is also essential to offer some necessary guarantees — like privacy — inherent to the data itself.

Engraving The Confusion of Tongues by Gustave Doré (1865)

Introduction

The english dictionary defines an afterthought as “an item or thing that is thought of or added later”. The World Wide Web is full of such afterthoughts: privacy, security, content rating or structured data are just a few examples. The problem with afterthoughts is that you keep paying the price for them over and over again and need to build often expensive workarounds.

If we look at structured data on the Web, for every crawl, search engines need to re-extract structure from the raw HTML, structure that is often present on the publishing side but gets “lost in translation”. Standards like Schema.org try to avoid this. But a very limited — yet growing — set of pages support the standard. What if the standard had been baked into the HTML specification?

We see a similar pattern with open data. The lack of schemas to represent datasets (Schema.org Dataset, DCAT and DSPL), the lack of taxonomies to describe the nature of the data and the lack of unique identifiers have created situation where open data datasets are hard to find, hard to use and hard to combine, as described in [Barbosa et al. 2014]. Also, the lack of standards is a barrier to commercial offerings.

The case for small data metadata (SDMD)

Small data — very loosely defined — is still in its infancy, like the Web was circa 1996. More and more devices are being connected (Fitbit, Jawbone, Garmin, etc.) and now major phone manufacturers are embedding “small data” features in their offering like most recently Apple ResearchKit for medical research [Apple 2015]. The recently announced WhiteHouse Precision Medicine Initiative will also rely on personalized data being collected and shared.

Assuming small data gets a trajectory similar to the Web, we should expect lots of data to be shared by people (producers) and lots of data being consumed for experiments (by consumers).

End users will need to be able to respond to numerous sollicitations about sharing their data. We should not expect this to be done manually, but rather mediated by a computer agent.

Data collectors will need to find relevant users based on profiles, demographics, locations, etc. and describe to them clearly the kind of data they want to collect and for what purpose.

Experimenters will need to find relevant datasets with data collected according to their specifications.

And all interactions between these entities should provide support for privacy, trust, “commerce” in a broad sense, and automated reasoning. Such an ecosystem will require robust metadata capable of answering questions like:

  • [user] Is experiment X compatible with my privacy policy?
  • [user] Is the purpose of experiment X compatible with my values?
  • [data collector] Find users who are a married couple with two kids in region X?
  • [experimenter] Find the best datasets for experiment X given a budget and a purpose?

Small data metadata (SDMD) is organized around schemas, taxonomies and unique identifier schemes. Rather than reinventing the wheel, we should build upon existing standards. Schemas most likely will extend DCAT (with concepts such as DataSet, Catalogue and Distribution; see overview) or Schema.org DataSet. For taxonomies, we can leverage W3C Simple Knowledge Organization System (SKOS) or W3C OWL for if more advanced reasoning is needed. For identifiers we can of reuse W3C Universal Resource Identifiers (URI). Some sectors have already started to work on this issue, e.g. Open mHealth for health data.

SDMD will be able to reuse tools that are applicable to generic datasets. But it will also require some domain specific features with attributes like user demographics, kind of data collected, unit frequency, measured vs reported, timestamp, unique hash, digital signature, privacy policy, license for content, reward, price, purpose of the experiment, scope of the experiment, etc.

SDMD offers interesting avenues of research around data modeling, automatic reasoning, constraint programming, matching algorithms and auctions.

Conclusion

Naysayers might argue that the successful growth of the Web is due to the absence of crippling standards to publish the data. That’s probably true. At the time, publishing was done by humans for humans and the published information was rather trivial in nature at first.

With small data, the context is different. Data is mediated by computer programs for computer programs and the published information is far from being trivial. Rather, it is immensely personal and requires some careful handling.

For this reason, developing and enforcing some small data metadata standards should not be seen as crippling but rather as empowering for the ecosystem. The definition of such standards and the related research should be encouraged today.

For small data, metadata must be a founding principle, not an afterthought.

Related Resources

Show your support

Clapping shows how much you appreciated Arnaud Sahuguet’s story.