Linked Open Statistical Data, Served Simply
Hiding the complexity of RDF Data Cubes behind GraphQL & JSON-LD
The recently launched LinkedSDGs pilot application, hosted by United Nations’ DESA, showcases the value of linked open data for connecting diverse information resources relevant to Sustainable Development Goals. One of the key data assets employed in this platform are SDG statistical data series, which are served by a prototype Web API combining GraphQL and JSON-LD mechanisms for pain-free and efficient serving of linked open statistical data for data consumers on the Web.
LinkedSDGs statistical data API: GraphQL + JSON-LD
The gist of the adopted approach is simple and comes down to two points:
- The data is served using a plain GraphQL API, whose schema adheres to the core SDMX data cube model (see: Figure 1)
- If requested, the GraphQL responses include a couple of additional JSON-LD-specific directives, which enable automatic mapping of the data to RDF Data Cube Vocabulary — i.e., the W3C-recommended ontology for representing linked open statistical data.
For instance, on executing the following GraphQL query:
{
DataSet(series: SL_DOM_TSPDCW) {
_id
_type
notation
label
unitMeasure {
_id
_type
prefLabel
}
}
}
we obtain a plain JSON response with some metadata about the SDG series SL_DOM_TSPDCW
:
{
"data": {
"DataSet": {
"_id": "http://metadata.un.org/sdg/SL_DOM_TSPDCW",
"_type": "DataSet",
"notation": "SL_DOM_TSPDCW",
"label": "Proportion of time spent on unpaid care work, by sex, age and location (%)",
"unitMeasure": {
"_id": "http://metadata.un.org/sdg/codes/units/PERCENT",
"_type": "Code",
"prefLabel": "Percentage"
}
}
}
}
However, if the same is requested with the Accept
header application/ld+json
the response is augmented with two special JSON-LD keys: @context
and @id
, which determine how the returned data should be converted to a valid RDF graph:
{
"data": {
"@context": "https://raw.githubusercontent.com/UNStats/LOD4Stats/master/sdg-data/sdg-series-data-cubes-context.jsonld",
"@id": "@graph",
"DataSet": {
"_id": "http://metadata.un.org/sdg/SL_DOM_TSPDCW",
"_type": "DataSet",
"notation": "SL_DOM_TSPDCW",
"label": "Proportion of time spent on unpaid care work, by sex, age and location (%)",
"unitMeasure": {
"_id": "http://metadata.un.org/sdg/codes/units/PERCENT",
"_type": "Code",
"prefLabel": "Percentage"
}
}
}
}
The all-important JSON-LD context, supplied in a separate file, defines a direct mapping from GraphQL type and field names to the URIs of the matching RDF terms, such as:
"DataSet": "http://purl.org/linked-data/cube#DataSet"
"Observation": "http://purl.org/linked-data/cube#Observation"
"Slice": "http://purl.org/linked-data/cube#Slice"
"unitMeasure": "http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure"
"Code": "http://www.w3.org/2004/02/skos/core#Concept"
"prefLabel": "http://www.w3.org/2004/02/skos/core#prefLabel"
As a result, we can immediately process the returned data with JSON-LD library and obtain a fragment of an RDF graph, which contains linked open statistical data expressed in RDF Data Cube Vocabulary with some additional annotations in other standard W3C terminologies:
<http://metadata.un.org/sdg/SL_DOM_TSPDCW> <http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure> <http://metadata.un.org/sdg/codes/units/PERCENT> .
<http://metadata.un.org/sdg/SL_DOM_TSPDCW> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/cube#DataSet> .
<http://metadata.un.org/sdg/SL_DOM_TSPDCW> <http://www.w3.org/2000/01/rdf-schema#label> "Proportion of time spent on unpaid care work, by sex, age and location (%)" .
<http://metadata.un.org/sdg/SL_DOM_TSPDCW> <http://www.w3.org/2004/02/skos/core#notation> "SL_DOM_TSPDCW" .
<http://metadata.un.org/sdg/codes/units/PERCENT> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> .
<http://metadata.un.org/sdg/codes/units/PERCENT> <http://www.w3.org/2004/02/skos/core#prefLabel> "Percentage" .
In short, by complementing GraphQL with some key JSON-LD features, the LinkedSDGs statistical data API implements a recently advocated concept of:
pure GraphQL inside & (optional) linked data outside
Why it matters?
This conceptually and technically straightforward approach provides three potentially invaluable services to the web data community.
Firstly, the prototypical 80% of data consumers who only come for the statistics don’t need to bother with peculiarities of the RDF data model or data serialization formats. They can happily get things done with plain JSONs served by a familiar GraphQL interface. And coincidentally, GraphQL — with its strict typing schema, fine-grained object structures and parameter-controlled filtering capabilities — happens to be a very convenient language for selective querying of large pools of statistical data, as demonstrated on the LinkedSDGs platform.
Secondly, the small portion of those users whose use-case does actually require what the linked data representation primarily offers — namely the semantic interoperability across different datasets — can… well, they too can get their RDF data easily with the use of GraphQL and JSON-LD. Most importantly, they can also go a long way before needing to handle directly those RDF data cubes, as the conversion to RDF is merely the final step of the data consumption process, supported by the JSON-LD library.
And lastly, and perhaps most interestingly, with the LinkedSDGs example, the statistical data providers might obtain a new, simple and flexible API model for serving their datasets as linked open data, which has always been a highly desirable, yet highly challenging prospect. Evidently, linked open statistical data has a longstanding history of… essentially not being there. And that for a number of reasons, partially alluded to above. Its generation is cumbersome and requires some specialized knowledge of RDF in general and the RDF Data Cube Vocabulary in particular. Sadly, this vocabulary, even though crafted with the full seriousness and professionalism of a W3C standardization process, is quite notorious for its apparent complexity and a few “loose ends”, where the data modeling prescriptions remain somewhat ambiguous. Even if valid RDF data is eventually produced, the immediate next question is how to serve it to potential consumers. A downloadable RDF file? Sure, but that’s pushing the full processing burden onto the user. Resolvable URIs? They are good for dereferencing individual resources perhaps, but a complete hindrance for those willing to consume larger portions of data. A SPARQL endpoint? It’s often hard to maintain and even harder to access by an average developer. A REST API? Right, that’s one possibility. However, designing an intuitive, granular access to complex data structures via REST endpoints is a creative design challenge in its own right, which makes the resulting services potentially non-interoperable between each other.
The combination of GraphQL with JSON-LD allows for virtualizing and serving statistical data as linked open statistical data with minimum effort, by bringing together best of two worlds:
- the powerful GraphQL data fetching, structuring and querying mechanisms over arbitrary data sources and back-ends;
- the flexible JSON-to-RDF conversion capabilities supported by the JSON-LD standard.
For more information on the LinkedSDGs pilot application see: https://sustainabledevelopment.un.org/LinkedSDGs/about.