Querying DBpedia with GraphQL

Cause getting your JSON-LD should be simple

Source: https://goo.gl/u7J5r3

Can accessing linked data be easier, lighter, and more friendly from what standard SPARQL endpoints currently offer? Yes, it can, or at least it should. In this blog post, we overview HyperGraphQL — a prototype GraphQL interface for querying RDF triplestores, which combines:

  1. the simplicity of the GraphQL query language,
  2. low complexity of supported requests, ensured by GraphQL schemas,
  3. front-end-friendly and hypermedia-enabled JSON-LD responses.

SPARQL endpoints

A SPARQL endpoint — the mythical gate to the promised land of the Linked Open Data cloud; coincidentally, also one of the most controversial building blocks of the Semantic Web technology stack. Try Googling for “SPARQL endpoints” and your top two hits are likely to be these: (1) a W3C-owned and maintained wiki page listing some 60 odd publicly available SPARQL endpoints, and (2) a still popular among the community blog post from 2013, by Dave Rogers, entitled “The enduring myth of SPARQL endpoints”. It is rather hard to tell which one of these better attests to the disappointing state of SPARQL endpoints on today’s web: a handful of showcase instances, many of which appear to be down or temporarily unavailable anyway — and that’s nearly 10 years since standardising SPARQL and 20 since launching the entire Semantic Web project; or a poignant post methodically exposing many of the fundamental flaws of the very concept itself. As a quick recap, these flaws could be roughly summarised as follows:

Too much expressive liberty permitted in SPARQL in combination with non-trivially sized datasets in open environments like the web, must inevitably result in too much burden on live RDF stores, and take critical toll on their availability.

This claim is easy to verify in practice. For instance, the query:

SELECT * WHERE { ?x ?y ?z }

issued at the DBpedia endpoint will simply attempt to download the entire dataset via the query service. A more complex graph query pattern, with several joins or aggregates, or even a simple one involving orderings, such as:

SELECT * WHERE { ?x rdfs:label ?l } ORDER BY ?l LIMIT 100

might easily blow up the expected query time to minutes or more. SPARQL endpoints? Well, quoting the neat concluding metaphor by Dave Rogers: “There is a reason there are no ‘SQL Endpoints’”.

Acknowledging this inherent trade-off between the scale, the complexity of data requests, and the availability of servers, have recently led to a new interesting perspective on the feasibility of deployment of linked data on the web. Originally proposed by Ruben Verborgh and his colleagues from Ghent University, the idea goes under the label of Linked Data Fragments (LDF).

The central premise underpinning the LDF concept is that by suitably restricting the expressiveness of queries a client can send to the server, a better balance on the scale-availability spectrum can be potentially achieved, which ultimately should make the open deployment of linked data more pragmatic and sustainable.

Semantic Web challenge

Another alleged cause of their lower than expected adoption rate of linked data technologies in modern web applications is the steep learning curve experienced by the newcomers to the semantic world. Let’s face it, rightly or not, SPARQL and RDF(S) appear to many as overly complex and conceived out of a visionary scientific agenda, rather than of a bottom-up engineering practice. The linked data paradigm is powerful and it does offer some vital possibilities, but its core standards are just hard to quickly comprehend and get to work by an average developer, who these days, unlike in the times when SQL was coming of age, have to digest and get her head around the next disruptive library, framework, or API design philosophy every other week.

In another popular and provocative blog post, “JSON-LD and Why I Hate the Semantic Web”, Manu Sporny, one of the chief architects behind JSON-LD, writes:

“That’s not to say that TURTLE, SPARQL, and Quad stores don’t have their place, but I always struggle to point to a typical startup that has decided to base their product line on that technology (versus ones that choose MongoDB and JSON on a regular basis). […] I like JSON-LD because it’s based on technology that most web developers use today. It helps people solve interesting distributed problems without buying into any grand vision.”

The pun here is that JSON-LD is nothing different but linked data, simply dressed in a much more developer-friendly and familiar outfit of JSON syntax. It’s not the principles of linked data that are disconnected from the realities of applied engineering. It’s their limited implementation in pragmatic tools that could offer immediate benefits at a reasonably low entry cost that seems to be a major hindrance to the mainstream adoption of the Semantic Web stack.

HyperGraphQL

So what does GraphQL have to do with all of this? GraphQL is a query language for APIs developed originally at Facebook, which since its first public release in 2015 has been quickly gaining popularity among developers. Following the typical tech hype cycle, some have already crowned it a successor of the whole RESTful API concept, which is probably excessive and essentially incorrect. Nonetheless, some strengths of GraphQL are undeniable:

  1. it employs simple schemas as a mechanism of defining the scope and shape of data that is to be exposed by the service,
  2. it supports a pragmatic and intuitive query language, influenced by the JSON syntax,
  3. it returns predictable results as JSON objects.

On a closer look, these might be just the key features to address some of the critical shortcomings of SPARQL endpoints, and the problem of deployment of linked open data in general. Let us consider a small application scenario.

Querying an RDF store via a GraphQL server. The response contains linked data as a JSON-LD object.

Suppose we want to host an instance of DBpedia and make part of the dataset publicly available. For example, we might want to expose only the data about people, their names, as well as the dates and places of their birth. With GraphQL it is straightforward to define a schema that would provide the required view on the data model.

type Query {
people: [Person]
}
type Person {
name: [String]
birthDate: String
birthPlace: City
}
type City {
label: [String]
}

In order to link this schema to the DBpedia’s vocabulary, we need to fix some kind of mapping to URIs of RDF resources, say something like:

{
"Person": "http://dbpedia.org/ontology/Person",
"City": "http://dbpedia.org/ontology/City",
"name": "http://xmlns.com/foaf/0.1/name",
"birthDate": "http://dbpedia.org/ontology/birthDate",
"birthPlace": "http://dbpedia.org/ontology/birthPlace",
"label": "http://www.w3.org/2000/01/rdf-schema#label"
}

With these two components in place we can now resort to the GraphQL query language as the medium for making suitable data requests. For example the following GraphQL query will fetch 2 instances of Person from DBpedia, their names, birth dates, including URIs of the DBpedia resources (_id) and their basic matching types (_type) via which they were retrieved:

{
people(limit:2) {
_id
_type
name (lang:"en")
birthDate
}
}

Under the hood, GraphQL queries of this form are still to be rewritten to SPARQL and executed over a SPARQL endpoint. However, the expressive freedom and the potential complexity of the supported requests can now be greatly restricted and controlled by the GraphQL service, which is the only one authorised to access the SPARQL endpoint of our instance. Technically speaking, the allowed queries are all tree-shaped, with the breadth and depth factor determined by the structure of the schema. And under these restrictions, they correspond just to a small and, arguably, a lightweight fragment of the underlying SPARQL language, which should ultimately take a lot of stress off of our DBpedia instance.

The response to the query above should naturally be a JSON-LD object consisting of two parts:

{
"people": [
{
"_id": "http://dbpedia.org/resource/Danilo_Tognon",
"_type": "http://dbpedia.org/ontology/Person",
"name": "Danilo Tognon",
"birthDate": "1937–10–9"
},
{
"_id": "http://dbpedia.org/resource/Andreas_Ekberg",
"_type": "http://dbpedia.org/ontology/Person",
"name": "Andreas Ekberg",
"birthDate": "1985–1–1"
}
],
"@context": {
"people": "http://hypergraphql/query/people",
"_id": "@id",
"_type": "@type",
"name": "http://xmlns.com/foaf/0.1/name",
"birthDate": "http://dbpedia.org/ontology/birthDate"
}
}

The first part is a valid GraphQL response of an expected shape and content. While for some clients this might already be a fully satisfactory answer to their request, some others might require a bit extra — namely, the full semantic context of the retrieved linked data. Such context is the pivotal element of JSON-LD objects, which facilitate efficient, semantics-preserving communication of hypermedia content, such as linked data.

To explore closer how the context affects the interpretation of a JSON document, you can paste this object into the JSON-LD playground and see how the meaning of data is reflected and preserved when converting the content to different formats. It’s because of this semantic transparency that we are also one transformation step away from uncovering this small subset of DBpedia, encapsulated in our HyperGraphQL response:

@prefix hgql: <http://hypergraphql/query> .
@prefix dbo: <http://dbpedia.org/ontology/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
[] hgql:people <http://dbpedia.org/resource/Danilo_Tognon>, 
<http://dbpedia.org/resource/Andreas_Ekberg> .
<http://dbpedia.org/resource/Danilo_Tognon>
a dbo:Person ;
dbo:birthDate "1937–10–9" ;
foaf:name "Danilo Tognon" .
<http://dbpedia.org/resource/Andreas_Ekberg>
a dbo:Person ;
dbo:birthDate "1985–1–1" ;
foaf:name "Andreas Ekberg" .

So what is HyperGraphQL? It is a GraphQL query interface over RDF stores driven by the exact principles and offering the functionalities discussed above. In being so, it seamlessly ties together some of the most interesting web technologies today: linked data, GraphQL and JSON-LD.

Summary

HyperGraphQL is a GraphQL-based interface for querying RDF stores, serving two key objectives:

  • hiding the complexities of the Semantic Web stack behind a simpler and more familiar to many clients GraphQL interface;
  • providing a flexible mechanism for restricting access to RDF triplestores down to naturally definable subsets of queries, which can be efficiently handled, thus minimising the impact on the stores’ availability.

A HyperGraphQL response is a JSON-LD object conveying the full semantic context of the retrieved data. This makes it a natural Linked Data Fragment interface and a query layer for hypermedia-enabled web APIs powered by RDF stores.

Demo

Try a live demo over DBpedia at: http://hypergraphql.org/graphiql, or simply execute the following predefined queries:

Code

Check out the GitHub repository at: https://github.com/semantic-integration/hypergraphql.

Acknowledgments: the HyperGraphQL prototype is currently under development at Semantic Integration Ltd. Thanks are due to my colleagues and contributors to this project: Philip Coates, Charles Ivie and Richard Loveday.