Final Report: Linked Texts - More Questions than Answers

Pelagios
Pelagios
Published in
4 min readMay 23, 2019

Matteo Romanello

At the conclusion of our work on Linked Texts under the auspices of Pelagios’s mini-grants program, we have achieved a lot. Most significantly, our meeting at Duke enabled us to make significant progress toward the initial release of the first Public Working Draft of the Distributed Text Services (DTS) API specification. The specification lays the groundwork for achieving the goals of the Working Group, which were:

  1. Have fully resolvable URIs for citable passages of texts
  2. Agree upon standards for identifiers and APIs
  3. Agree upon ontologies to express metadata and links between texts.
  4. Establish a common data source of identified text repositories.

One of the main points of discussion at the WG meeting was how to handle the resolution of text URIs, and whether is was feasible to develop a Handle System, based on the one that is used for resolving Digital Object Identifiers that could work with DTS. A Handle System works by querying a database for the submitted ID, and then, if a matching URL is found, redirecting the requester to that URL. The situation for DTS is more complex, partially because of the legacy of Canonical Text Services (CTS) identifiers, which are URNs. The problems begin with the lack of any central management of CTS IDs. Unlike DOIs, which consist of a namespace identifier and an item identifier, CTS URNs are more hierarchical and more meaningful at each level.

Take urn:cts:greekLit:tlg0012.tlg001 for example, which identifies the Iliad of Homer. Crucially, it identifies the “work” rather than any specific edition of it. Usefully, a CTS URN allows us to denote a passage either within the abstract work (urn:cts:greekLit:tlg0012.tlg001.1.1–1.10, i.e. Iliad 1, lines 1–10) or within a specific edition (urn:cts:greekLit:tlg0012.tlg001.perseus-grc1.1.1–1.10, i.e. the version of the Iliad hosted by the Perseus Project, book 1, lines 1–10). This means there is a basis for identifying the common abstract notion of the first line of the Iliad, and comparing those across all available editions. But what should a resolver do, given a work ID? A possible answer would be for it to list available editions (and translations) of that work, but in that case already, the resolver would be behaving like a catalog (or a regular DTS Collection endpoint) rather than a resolver. It gets worse when we consider that edition identifiers are uncontrolled, and that there is no reason multiple repositories might not contain copies of urn:cts:greekLit:tlg0012.tlg001.perseus-grc1. If the “resolver”, were to function like a handle resolver, it would presumably have to choose one of the options and redirect to that. Worse yet, there can be no guarantee that independently-managed copies of a digital edition have not been further edited and have therefore diverged. It seems likely that despite the attractions of a handle resolution service, what is actually needed is a centralized catalog system.

DTS side-steps some of the difficulties around CTS identifiers by separating the concerns of work, edition, and passage identification. Works can be modeled straightforwardly as DTS collections and passage resolution is dealt with by the Navigation and Document endpoints. Passage identifiers are dealt with as parameters rather than as part of the identifier. Ironically, while this provides a useful technical solution to the problems of passage identification, it re-surfaces the potential need for a resolver for CTS URNs that contain passage identifiers. Perhaps the ultimate solution will be a further DTS service capable of decomposing complex identifiers so that their components may be dealt with by the other endpoints.

Metadata is another area where much work remains to be done. DTS presently mandates no document or collection metadata other than a name and short description. It provides for simple metadata in the form of Dublin Core Terms and reserves a property for defining “extended” metadata in the Collections JSON-LD response. What that extended metadata might look like is currently undefined. One might wish to see some of the following kinds of information around documents:

  • Source Information
  • Author
  • Title
  • Date / Place of Original Creation
  • Repository / Cataloging Info
  • Publication History
  • Comparanda / Related Documents
  • Physical Description
  • Language(s)
  • Provenance
  • Mentioned People / Places / Events
  • Surrogates (Editions, Images, Translations)
  • Edition/Translation Information
  • Editor/Translator
  • Publication Info
  • Language(s)
  • Source(s)
  • Related Editions
  • Revision History

This list is divided into “source” and “edition” or “translation” data in order to further problematize some of the assumptions surrounding CTS, which was designed to deal with “Canonical” texts, and whose organization, at least for Classical Greek and Latin texts revolves around the “canons” embodied by the Thesaurus Linguae Graecae (TLG) for Greek, and the Packard Humanities Institute (PHI) for Latin. These are not themselves without problems, based as they are on conventions of 20th century print publication. Their categorization of texts into Text Groups (largely focused on author) and Work makes far less sense for the vast array of documents that are not part of any canon, may not have any known author, and cannot sensibly be associated with an abstract work, because all the evidence for them is based on a unique source document. Even the top-level categorization of CTS into “greekLit” or “latinLit” might be disrupted by a bilingual Arabic-Greek papyrus.

We have really only just begun the project of understanding how best to represent the kinds of document metadata that may be desirable in a DTS system, and while there are a number of vocabularies that accomplish part of the task, it will be a complex task to sort out how these should work together. The organizers will participate in GraphSDE2019: Workshop on Scholarly Digital Editions, Graph Data-Models and Semantic Web Technologies in Lausanne in June, where we will present some of this ongoing work and hope to learn about other tools and vocabularies for modeling text and their interactions. Watch this space!

--

--