Introduction: Linked Texts

Pelagios
Pelagios
Published in
6 min readJun 6, 2019

--

Matteo Romanello, 30 May 2018

An image of a page from the manuscript Marcianus Graecus Z. 454 [= 822] (aka the Venetus A), the oldest complete text of the Iliad in existence. Image derived from the original, Biblioteca Nazionale Marciana, Venezie, Italia, CC BY-NC-SA 3.0

We, Matteo Romanello and Hugh Cayless, are happy to introduce our Pelagios-sponsored working group Linked Texts. This WG aims to increase the presence and coverage of text-related data within the Graph of Ancient World data. We aim to consolidate ways of minting resolvable URIs for texts and text passages by harvesting existing text repositories via standard APIs. We also aim to agree upon ontologies for representing metadata and links between texts, and to produce a set of guidelines to help others to implement the identifiers and ontologies agreed upon by the WG.

Motivation

Over the last 10–15 years, a number of collections of resolvable URIs have been emerging, and started forming the Graph of Ancient World Data (GAWD). There are projects that produce resolvable URIs for places (Pleiades), chronological concepts (PeriodO and Chronontology), and people (SNAP:DRGN), among other things.

What does it mean for a URI to be resolvable, and why does that matter? A typical way of resolving a URI is to look it up in a browser. Resolving means here to retrieve the resource identified by the URI. One of the pillars of the Semantic Web vision is that URIs are used to identify the “things” about which we make some assertions. Upon resolution, such URIs should return an RDF description of the resource they relate to.

Let us take as an example the URI for the city of Athens in the Pleiades Gazetteer, <https://pleiades.stoa.org/places/579885>. Looking it up in the browser causes an HTML page to be returned by the server and displayed in the browser. But Pleiades provides representations of the same resource in other formats (what in jargon are called serializations), including one in RDF/XML. To ask explicitly for that, we pass this information within the HTTP header that is sent as part of our request to the Pleiades server. Using the command line tool CURL, one would normally specify such a request as follows:

curl -H “Accept: application/rdf+xml” https://pleiades.stoa.org/places/579885

Try it, and you will see that the very same URI is resolvable also to RDF/XML.

Texts are problematic in their own special ways. They are both concepts (the idea of the Iliad, e.g.) and physical documents (The Venetus A manuscript of the Iliad). They may exist in a variety of physical end digital editions and translations. They may have one or more authors, and the nature and even existence of these (e.g. Homer) may be subject to debate. Further, texts (like physical places) have internal structures which we may want to point to (e.g. line 10 of the second book of the Iliad). And texts themselves make references to other texts and have internal cross-references and reuse each other in various ways.

Capturing and representing this kind of information on the web is problematic. Not so much because linking is hard — it is one of the basic functions of the world wide web after all — but because of a lack of agreed-upon standards for referencing even canonical texts with well-understood structures.

What is currently missing — and we will work towards in this working group — is to have fully resolvable URIs for citable passages of texts, linked to relevant resources like Perseus (catalog and library), the Classical World Knowledge Base (CWKB) and Papyri.info. Initial work will focus on ancient Greek and Roman texts because of the foundations already laid in that area, but the Working Group will aim to expand beyond this focus.

Background

The good news is that we are not starting from scratch. The Canonical Text Services initiative provides us with a framework for identifying both “ideal” texts and editions and translations of those text.

For the last two years, a group have been working on a “Distributed Text Services” specification which defines a protocol not only for browsing CTS collections and resolving CTS references but also for dealing with types of text not catered to by CTS (e.g. documentary papyri, which do not have a canonical citation scheme). Work has also progressed on defining ways to represent citation and text reuse.

Having such APIs and, more importantly, having text repositories that implement them means that we can process programmatically the contents of such repositories and extract information about the texts they contain.
For example, one can use the CTS API of Perseus to extract all canonical text divisions of e.g. the Iliad. We would obtain the list of books in the Iliad, and for each book the list of lines into which a book is divided. As each of these elements is also identified by a CTS URN, we could then mint a URI for each text element in the Iliad, linked to the corresponding text in Perseus, and resolvable to an RDF description. Which ontology will be used to represent this textual information is precisely one of the aims of our working group.

Goals and Outcomes

Our main goals in convening this working group are:

  • Agree upon standards for identifiers and identifier resolution (e.g. APIs).
  • What standards and protocols will be supported (CTS, DTS, etc.)?
  • Agree upon standards (e.g. ontologies) to express metadata and links between texts.
  • The LAWD ontology already models some aspects of ancient texts, while aligning itself with other vocabularies like DC, OAC and CIDOC-CRM.
  • The HuCit ontology and knowledge base aim to provide a registry of resolvable URIs specifically for canonical texts, and propose an ontological model for text structures (be they canonical or not).
  • Establish a common data source of identified text repositories/providers.
  • This could be a plain text document, to which people can contribute e.g. on GitHub by issuing pull requests, listing text repositories that support one or more of the agreed upon APIs. We could then harvest programmatically these repositories and mint resolvable URIs for the texts and text passages they contain.

Produce a set of guidelines for various types of stakeholders, to help them implement the identifiers and ontologies agreed upon by the WG. Develop estimates of the ongoing maintenance and curation costs for a centralised service to mint and resolve URIs for texts.

CTS identifies texts using a URN syntax, meaning a service is required to resolve a given CTS URN. The Linked Texts Working group will be looking at the feasibility of a centralised service which would act as a resolver for both CTS identifiers, and also other types of text identifiers. We hope the DTS APIs may serve to provide the basis for such a service, and that the work we have done on modelling and parsing citations may serve as the basis for recognising generating actionable citations in existing texts.

A central registry providing resolvable URIs for texts and text passages can be leveraged in a wide range of use cases, such as for example:

  • 3rd party services like Recogito or CitationExtractor could use this data to support the annotation/extraction process.
  • Online publishers (e.g. CHS online publications, BMCR) could use the data to create links between their digital publications and existing digital libraries/text services.
  • Digital Libraries may to “register” the texts they publish (and their identifiers) so that 3rd parties can e.g. link to them.

Finally, we would like to invite anyone interested to join our activities. We will use a Pelagios Commons forum for discussion, as well as this GitHub repository to share any documents or code we will produce. On June 21–22 we will have a face-to-face meeting at Duke University (do drop us a line asap if you would like to participate!).

--

--