Science in the Making: data model and API

Tom Crane
digirati-ch
Published in
10 min readFeb 9, 2019

This article was first published in July 2017 and updated November 8, 2017

After some productive UX work, we need to start making things happen in software. This means we need to think about how that work informs a data model. We must address the requirements that emerged from the UX workshops as well as The Royal Society’s stated aims for the site.

The model doesn’t have to be 100% right from the start. We need to see how people use the site, and how developers use the API (that means us too). That’s the point of the pilot. But we need to start with a model that works for exploration by end users, for web editorial activity, and for the addition of new knowledge by both the public and The Royal Society. We also need to keep in mind how the model accommodates and connects to other library and archive content, from The Royal Society and elsewhere. The pilot content is from the Royal Society’s Journal Collection archive, but an important aim is to incorporate objects from other collections internal and external to The Royal Society.

Requirements from workshops

The work so far suggests that the model and API need to support user experiences that include:

Navigation of content via aggregating pages

  • Things with the same topic (keyword, subject heading); topics can come from user-generated tags as well as existing descriptive metadata
  • Things in the same date range, or connected with the same place
  • People (and other agents) associated with an object; and from those people, more objects connected with them

Navigation of content through extensive linking

  • Huxley is the referee of this manuscript, Rainey is its author, it became the published journal article at this DOI…
  • Links allow users to follow threads through webs of relationships between content and agents

Visualisations

  • Timelines
  • Maps
  • Graphs of relationships between objects and people (or more generally, agents) — the Web of Discourse

Contribution of new content by end users

  • Transcriptions of individual views (distinct from a whole item; a page of a letter)
  • Tags (to start with, using Library of Congress Subject Headings as a controlled vocabulary)
  • Comments, narrative
Prototype for The web of discourse — the model supports the projection of results into a graph, here showing correspondence for items tagged with the topic “Colour”

Who are the users of the model?

Who benefits from the the model? Who are its users and where and how do they use it? It helps to think of three broad categories (but there are more and subtler distinctions).

Users unaware of the model — people interacting with the content, on the web. These users might sense the influence of the model in the design, information architecture and user experience of the site, but have no need for a formal understanding or description of it.

Model-aware users — editorial and other content creation or management activities conducted by Royal Society staff happen in the context of the model, but the abstract model itself is meaningless until expressed in software that staff use — which means editorial or curatorial user experience. We’re not just transforming source metadata direct to the web, we have a CMS as well. The source metadata is augmented and enhanced by editorial processes and user contributions in that CMS. Some content has no augmentation, some content has extensive additional content created.

We don’t want the choice of CMS to dictate a conceptual data model, but if that conceptual data model has no usable alignment with a content scheme in a CMS, it won’t work — it may be beautiful, but you have to build content with it; define content types, make it easy for editors to make new instances of content and link them together. The editorial user sees the model expressed as content management processes. The model finds expression in the content types and fields of the CMS. The CMS appearance of the model doesn’t have to be identical to the published description of the model. Content workflow, organisation and other practical considerations alter the expression of the model in a CMS.

Developers who might explore the APIs or even read some documentation about the model and its expression in APIs. Developers means us, as makers of the site for the first type of user. If the model doesn’t work for us, then why should anyone else be expected to use it! Developers also means other people building things we haven’t even thought of.

Activities, Roles and Agents

Our first decision is about our approach to modelling the relationship between the archival items and people.

Archive item RR/5/75 is a referee report by Lord Kelvin, on Dynamical theory of the electromagnetic field by James Clerk Maxwell, which is item PT/72/7. At least two people are involved with this item. Correspondence has authors and recipients. Submitted material can have multiple referees. Photographs have subjects and photographers. People have roles that describe their relationship to archive items.

So far we have the following roles:

Kelvin and Maxwell are agents that have some role in the life-cycle of the object. Rather than choosing some relationships as direct properties of an object, we adopt an event-driven approach where the relationship between objects and agents is indirect; it goes via an Activity, which is where a person and a role come together in relation to an object. If we are lucky, we might have information on when and/or where the activity took place:

This approach supports the kind of narrative, navigation and visualisations suggested by the UX work. It is important that it works with sparse data, however. A rich web of multiple roles involved with an object is great, but at the other end of the scale we have objects with just one person connected via one role, with no time or place information. This still needs to drive a good user experience.

Editing the model for the web

All the CMS screen shots in this post are from Omeka S, where the the archival information is enriched with editorial content. The names of classes are not the same as in our conceptual model — Units are ArchiveItem, Agent is Person, Activity is PublicationAction — we’re feeling our way from the specific to the general. The content management experience is important too, otherwise there won’t be much for the end users to look at. Editors don’t write triples to a triple store, they edit content in the context of a content model.

Items:

An item:

The expression of activity as managed content:

An activity (PublicationAction), here in edit mode:

The PublicationAction item in Omeka S links to a Role item and an ArchiveItem item (the Activity has a Role and a Unit). We don’t have the time or the place for this activity, those fields are blank.

Aboutness, Ofness and Onness: how we use IIIF

One important feature of our approach is a strong opinion about the role of the IIIF Presentation API. It’s far more than a standardised way of delivering pictures of the manuscript pages, drawings and photographs. The IIIF Presentation API is our digital surrogate for the object. IIIF makes this representation a two-way street. Other can people can tell us things about the object in IIIF-space. IIIF is for presenting the object and all its content (e.g., images and text transcription), and establishes the shared space in which site users (and anyone else on the web) makes assertions (e.g., adds content of or about the object).

We think there is a clear division between the Royal-Society-specific data model for a given item (the main subject of this post), and the standardised IIIF representation of the item, its digital surrogate. That’s what people are looking at. It’s clear which model is responsible for what. The bespoke data model provides the aboutness, IIIF provides the ofness and onness. I’m using the terms aboutness and ofness in a particular way that needs clarification:

Content about the object tells you who wrote the letter, when they did it, what size it is, that it’s about physics, and optics in particular, and theory of colour. This is descriptive metadata. It’s what most APIs in cultural heritage are about.

Content of the object, in our terms, is the detailed presentation of a digital surrogate, in shared annotation space. It’s for painting pixels on the screen to show manuscript pages, it’s about the page-level or line-level or word-level transcription of manuscript material, it’s about any other content or data that can play a role in presenting the object. In IIIF terms, this means annotations with the painting motivation; that is, annotations that render content in the shared canvas space established by the IIIF representation.

Content on the object is every other type of annotation, which includes commentary, notes, links to articles, blog posts, translations, editorial content and narrative description. As the representation of the physical object in shared digital space, the IIIF manifest is the integration hub for content. The IIIF canvases are the most obvious carriers of content, but annotations can associate any IIIF resource with content — manifests, ranges, canvases. Painting annotations (ofness) only target canvases; other annotations (onness) can target anything in IIIF.

Descriptive model (Aboutness) and Content model (Ofness and Onness)

There might well be an overlap between the content linked from the descriptive metadata about the unit, and the IIIF resources. Some information is available in both representations, but with a different purpose. The identification of subject is a link to a topic page; this subject is a property of the unit, and a tagging or classifying annotation on the manifest.

All three (about, of and on) are required to build a rich content environment. A traditional metadata API offers little for the representation of the object it describes. It’s information about the object with the material — the library card, not the book on the shelf. With IIIF, we can load our virtual equivalents of Post-It notes, interpretative essays and commentary on the IIIF object, the carrier of that content in shared web space, addressing it and its parts as precisely as we need.

Allocating content to the IIIF representation makes a lot of modelling head-scratching go away. For example, transcription could range from the single word of a ship name in a photograph of the Discovery (NAE/1/15) to the full text of a printed work. This is some of the content of the object that we wish to make available. The ofness model (the IIIF Presentation API) provides us with the means of associating whatever content that culture and context deems important for an object with an abstract representation of the space it occupies.

This approach, crucially, puts all the information you have that can possibly be considered of or on into a interoperable domain, where it can work with other content, and other software such as annotation servers and clients.

Relationship to other models

The starting material for Science in the Making is the Society’s Journal Collection, which is archival material conforming to ISAD(G). The archive hierarchy and its finding aids are completely unused in the launch of the pilot, so the model does not address “part of” relationships except at the item-image level (or manifest-canvas in IIIF terms). This is a conscious omission, we’d like to see more real world use before addressing membership relations to other intellectual entities. We don’t need it straight away because in the pilot, navigation is by topic and other aggregations rather than by archival structure. Whether at the API level or at the user interface level, navigation by both mechanisms is interesting and important to explore.

Roles and Activities could be mapped from relators in MARC21, if we need to integrate library material into the model. There may or may not be enough information to give temporal or spatial coverage to an individual activity, even if more than one can be generated from the record. This is fine, the activities do not require temporal or spatial information to participate in networks of discovery.

The CIDOC-CRM is echoed in the event-driven approach to describing the material but there is no attempt to use any of it. The use cases for exploration and presentation of the resources suggest an event-driven approach in the data model, it falls out naturally from the UX. This means the model gives access to the processes (in time and space) through which the content came to be. In a static representation (a typical bibliographic description), this information may be present but hard to get at. In contrast, museums are more used to describing objects as dynamic processes, with life-cycles and provenance.

The Europeana Data Model allows for an object-centric description and an event-centric description, even for the same object. There are elements of this approach here.

It would be an interesting exercise to describe the history of an archival unit using ActivityStreams, which is a W3C Technical Recommendation from the Social Web Working Group. ActivityStreams has many applications beyond modelling social interactions on Facebook, but the interactions between humans that make up the lifecycle of an intellectual object are social interactions, and the model could work well for this.

Next in series: How we made Science in the Making

Acknowledgements

The recasting of the terms aboutness and ofness in IIIF terms was inspired by this comment: https://github.com/IIIF/api/issues/1224#issuecomment-324713788. I have added “on-ness”.

--

--