Interim Report: Linked Texts

Published in

Pelagios

6 min readJun 6, 2019

Matteo Romanello, 3 October 2018

Thanks to the generous funding received from Pelagios, we were able to organise a Working Group workshop held at Duke University on June 20–21 2018. The workshop brought together (physically) 10 attendants, plus 4 remote participants who managed to follow and join our conversation from afar. It is worth noting that many of the participants are also involved in work on the Distributed Text Services specification, presented in more details here below, which is very relevant for our WG.

What were the goals?

Agree upon standards for identifiers and identifier resolution (e.g. APIs).
Agree upon standards (e.g. ontologies) to express metadata and links between texts.
Establish a common data source of identified text repositories/providers.

Given these overall objectives, the agenda was intentionally kept fairly open, with some short input presentations during the first half day, but leaving enough time for discussion, as well as some improptu presentations, over the two remaining half days.

Main points of discussion

We started the meeting with presentations on the (then) current architecture and state of the Distributed Text Services specification, a prototype Handle Service developed for Canonical Text Services, and on the current state of Linked Data infrastructure for representing the relationships between texts (especially ancient texts) online.

CTS is built around the idea of unique identifiers, which use a URN syntax for canonical works. The majority of CTS identifiers in the wild are based upon the Thesaurus Linguae Graecae’s identifier system for Greek texts and the Packard Humanities Institute’s for Latin texts. CTS IDs are built around the idea of Text Groups, e.g. works by Vergil (phi0690), Works, e.g. the Aeneid (phi003), and then particular editions, e.g. the Perseus version of the Aeneid, taken from Greenough’s edition of the works of Vergil (perseus-lat1). A CTS URN for this edition would be urn:cts:latinLit:phi0690.phi003.perseus-lat1.

This model works quite well for things like the Aeneid, but is not so effective for other types of texts, such as papyri or inscriptions, which tend to be organized thematically or geographically and around the publication of print editions. It also tends to break down around the edges of literary works, when, for example authorship attribution or scholarly opinion about the structure of works changes. Works with more than one citation scheme are also a problem.

What to do with the URNs is another question. In an online environment, a resolver system is required to retrieve a text, given a CTS ID. While such a system is relatively straightforward to construct for a given collection of texts, it would be nice to have something into which one could plug any CTS ID and find out what editions are available. A Handle Service would permit the registration of CTS or other IDs and redirect queries using them to the appropriate locations.

One big advantage of CTS is that it comes with some built-in semantic categories and provides “nouns” for authors (or other convenient / traditional groupings of texts), abstract works, of which editions may be an expression, and the editions themselves. What it doesn’t do is provide verbs (or properties) for those nouns. URNs will plug quite happily into an RDF-style representation of this information, however. This prompts questions of what kinds of relationship one would want to encode, and how to encode them. Various ontologies exist that do parts of what we’d need, but nothing comprehensive exists that would allow us to indicate, for example, that edition B is a translation of work A.

The DTS specification attempts to accommodate CTS-style identifiers and to define a protocol for working with them. It does not require FRBR-style collections, where items are organized by Work and Expression, but since DTS collections can be organized however the hosting system wishes, this type of organization is certainly possible. The group determined that a Handle or Handle-like system would be a desideratum but that the top priority must be the completion of the draft DTS specification. We discussed metadata and ontologies at some length. A DTS response has fields for both basic Dublin Core metadata and extended metadata following any standard, but DTS has not at this point made any recommendations for what might be put in extended metadata. A particular concern for the working group was how best to represent the “citation model” of a text, i.e. the structural composition and how units within that structure are to be cited. This is often simple, e.g. poem -> line, or book -> chapter -> verse, but may be varied and complex in some cases. One outcome of the discussion has been the development of a proposal for encoding information about citation structure in the TEI Header.

The discussion during the Working Group meeting was wide-ranging, from thinking about how to develop a specification for a resolver system, to annotation and bibliographic ontology, to sustaining work on DTS for the longer term. Thanks to an impromptu presentation by Thibault Clérice, we also had the chance to learn more about how the CTS and DTS APIs are supported in the Capitains suite of tools, and especially the MyCapytain library. The upshot was that we agreed our first priority had to be getting the DTS specification to a first Public Working Draft stage. This was completed in early September and we are now in the process of gathering feedback and working on implementations of the protocol.

One of the objectives of our working group has been to try to cater for texts from a range of disciplines as wide as possible. Although our starting point were texts of the classical antiquity, the DTS implementations already available expose ancient texts written in Greek, Arabic, Ethiopian and Eritrean, as well as modern texts in French, thus already showing that the text model implemented in the API is not by any means tight or specific to classical texts.

Taking Stock & Next Steps

Now that the public working draft of the DTS API is out, together with a bunch of working implementations of this API, we are in a much better position than a few months ago to think about how, concretely, the central resolution system we originally envisaged could be implemented.

It became also very clear that the CTS URN Handle proposal by B. Almas does have substantial overlapping in terms of functionalities with the idea of a resolver. In the coming months we will be studying it more closely in order to identify which things, concepts, workflows from their proposal could actually be reused.

In general, what we all agreed on at the meeting at Duke, is the importance of securing future funding to develop further the ideas and activities that this working group helped to shape and/or consolidate. Implementing, hosting and maintaining such a piece of infrastructure will not happen without commensurate time and resources, and we do now have a much better idea of how things could look like in practice.

As a final side note, the activities of this working group turned out to be a very good way of learning about things that others have been doing on the topic of Linked Texts, and we were not yet aware of. We learned about INTRO, an Intertextual Relationships Ontology for literary studies, which would require resolvable URIs for text passages in order to work. We discovered KNORA, a software framework for storing, sharing, and working with primary sources and data in the humanities, developed by colleagues at the DH Laboratory in Basel. Knora natively speaks RDF and SPARQL, and is potentially very relevant for our working group as it could expose KNORA texts in a way that is compliant with what we are envisaging, thus increasing the number of texts out there that could be referred to by means of granular and resolvable URIs. And, finally, we were happy to be informed that next summer in Lausanne it will be held a Workshop on Scholarly Digital Editions, Graph Data-Models and Semantic Web Technologies, organized by Elena Spadini (University of Lausanne) and Francesca Tomasi (University of Bologna). We have no doubt that, at this workshop, many interesting discussions — and relevant for Linked Texts — will take place!

Interim Report: Linked Texts

Main points of discussion

Taking Stock & Next Steps

Written by Pelagios