Flip flops or thongs? Using vocabularies to harmonise marine data

Rob Thomas
opengovintelligence
4 min readSep 4, 2018

One of the challenges in bringing together data from multiple sources are the semantics of how the data variables in a dataset, or the information provided for context (also called metadata), are named. Even within a scientific discipline, such as marine science, different regions around the globe use different lists of terms or vocabularies. Further sub-disciplines may have different terms to name things that cross-over multiple disciplines. Taking an everyday example, even when talking the same language regional variations in what is understood by a word can lead to confusion.

Are these trousers (UK) or pants (USA)? Are these flip flops (UK) or thongs (Australia)?

Efforts to standardise the terms a community uses ,and what the terms are understood to mean, have resulted in the creation of many controlled vocabularies. An important part of agreeing the terms that will be included within a vocabulary is to agree unambiguous definitions. These controlled vocabularies provide a list of terms with agreed definitions over a broad range of disciplines (e.g. oceanography, geography and geology) that are of relevance to describing things clearly. Where these vocabularies are available as online resources with unique URIs, using these URIs in metadata and to label linked open statistical data (LOSD) solves the problem of ambiguities associated with data labelling. It also enables records to be interpreted appropriately as Linked Data by machines. This opens up datasets to a whole world of possibilities for computer aided manipulation, distribution and long term reuse.

An example of how the Irish OGI pilot benefits from the use of controlled vocabularies can be illustrated when considering a scenario where data values are pulled together from datasets published by a range of organisations. Sea temperature is an important variable in understanding the real-time conditions of the marine environment. It affects scenarios such as search and rescue because the temperature of the water column indicates potential survival time rates for an individual requiring rescue by the emergency services.

Example of terms relating to temperature in the SeaDataNet Discovery Parameter vocabulary published on the NERC Vocabulary Server. Note the definitions to provide unambiguity over what the term covers.

In bringing together sea temperature data from a variety of sources, one data set may have a column labelled “Temperature of the water column” and another might have “sea temperature” or even “temperature”. To the human eye, the similarity is obvious but a computer would not necessarily be able to interpret these as the same thing. There are a range of options to overcome problems such as these:

1. Hard code all possible naming options for sea temperature into the aggregation software. As new datasets become available where new terms for sea temperature are discovered a new version of the software must be developed and released.

2. Everyone use the same set of terms for labelling their datasets. May work moving forwards, however the work involved to update all historic data archives makes this an unrealistic. Also there may be slight nuances to the terms so that a like for like replacement is not possible. Getting universal agreement may not be simple.

3. Publish the controlled vocabularies on the web as Linked Data and publish mappings between related and overlapping vocabularies. Use software that can reason across these mappings.

Taking an analogy to power sockets, there are many regional variations but the solution has been to use adapters rather than all use the same socket type. And this is a nice analogy for how mappings between vocabularies can solve the problems posed by the variety of vocabularies that are available. This is made easier if the data provider has used a published vocabulary rather than using terms they make up themselves.

(image: https://www.flickr.com/photos/kewl/7006904747)

The Marine Institute uses environmental community vocabularies from the NERC Vocabulary Server (NVS) to describe its online data holdings within existing ISO 19139 metadata content and it aims to leverage the use of these same vocabularies in Linked Open Statistical Data (LOSD) within the OpenGovIntelligence (OGI) Irish pilots. The NVS is maintained by the British Oceanographic Data Centre, who co-ordinate requests for new terms and modification or deprecation of existing terms on a regular basis. They govern the content on behalf of infra-structure (e.g. SeaDataNet) or publish the vocabularies on behalf of user communities (e.g. Climate and Forecast Conventions). New terms are added, modified and deprecated on a regular basis and, with the vocabularies being published using the Simple Knowledge Organisation System (SKOS), it facilitates co-creation as the information is published using a W3C standard.

--

--