How to cite Linked Open Data

How should people cite records in our database? This question was put to me by Matteo Valleriani in anticipation of the publication of the Sphaera database. A question I forwarded to our publication manager Lindy Divarci. Surely there must be a guideline somewhere.

Lindy pointed me to this entry in the Chicago Manual of Style. Following this guideline, people could cite our database records as e.g.

“De la Sfera del Mondo, di Alisandro Piccolomini, divisa in Libri quattro” in Sphaera Database (http://sphaera.mpiwg-berlin.mpg.de/id/item/d4df40f1-2b83-477b-8f25-ecf14cfc2537; accessed December 14, 2017). http://db.sphaera.mpiwg-berlin.mpg.de.

Problem solved! (although I could not find any suggestion on how to do an in-text citation)

But wait.

What are we actually referencing here?

The database we have is RDF based: a linked data resource that does not contain database records in the strict sense. An object based database contains objects that can be addressed. A relational database is made up of tables, of which a row can represent a database record. In our database however we have merely sets of triples that link together to form a graph.

The identifier we use above to reference the database record actually only identifies one entity in a triple.

<http://sphaera.mpiwg-berlin.mpg.de/id/item/d4df40f1-2b83-477b-8f25-ecf14cfc2537> is an URI that, in our case, identifies a physical book. What we’d like to reference is the bibliographic data that surrounds this book and which users can see when they navigate the database: information on the authors, place and date of publication, etc.

In many Linked Data sources this is not a real issue, as an endpoint will typically retrieve all triples where the given URI is used as subject or object, and this delineates a database record.

The URI of a book record resolved in a web browser retrieves the triples and displays a button to access the book record view.
The URI of a book record resolved in a web browser retrieves the triples and displays a button to access the book record view.

However, in a CIDOC/RDF database such as ours, the entities linking immediately to and from a given URI are usually not enough. Information such as the author of a book is encoded in a branch that is formed by a path that connects multiple entities. Defining boundaries within such a graph that could denote a database record is not a trivial task.

One option would be to not reference individual entities, but Named Graphs. Named Graphs allow sets of triples to be addressed via a unique URI. The problem is that in our case, although we do have named graphs, the data that makes up the bibliographic records does not reside in a single graph, nor can it be represented in this way. An author, for example, may be linked to several books. The information of the author can therefore not be represented in the graph of a book without being duplicated.

So what is the solution? As our URIs do not identify records, but merely atomic entities of a record, they actually do not reference what researchers using the database would want to cite. What they probably want to cite is the database records as they appear in the browser when they navigate the dataset.

It turns out we don’t actually have anything that identifies these records. Which is good news! It means we can assign new identifiers and, while we’re at it, pick ones that are a tad more user friendly and recognisable than our existing URIs.

DOIs are assigned to the book records.
View of a book record with DOI

So tl;dr: we assign DOIs to reference our records. The DOIs resolve to the record view of the book. The URIs of the book entities retrieve the immediate triples and, when resolved in a web browsers, users are presented with a button that allows them to navigate to the record view.

An added value of using DOIs is that it gives us more control over versions of our database records, another non-trivial issue with RDF databases. While our URIs are persistent, the links to and from them may change with future releases of the database. This may result in the data that is being retrieved being different from what researchers saw when they cited the record. By using DOIs, we can assign new identifiers with subsequent releases of the dataset while existing DOIs may still point to earlier versions of the database.