How we made Science in the Making

Tom Crane
digirati-ch
Published in
13 min readFeb 11, 2019

This is the final part of a series of posts describing our work on the pilot archive project, Science in the Making.

The output of our second UX workshop gave us sketches for the key pages of the site. We also had a hard drive full of digitised TIFF masters of the items selected for the pilot, a file share containing some spreadsheets, and the Royal Society’s pattern library for design reference.

These spreadsheets were the first data artifact provided by The Royal Society. They are the manual enrichment of the existing archive records from the archive management system, CALM, with ideas from the emerging data model. In CALM records, there are some links to authority data for Fellows of the Royal Society, but nothing like the level of detail or the formal structure we would need to generate the user interface sketched in the workshops, as you can see from the current CALM UI at the item level, and above. The spreadsheets start the process of linking people and items together, by enriching the data in spreadsheet rows.

You can’t drive the public web site from spreadsheets, however, and as a collaborative records management solution they leave many things to be desired. We made the decision to use Omeka S as the integration point for the enriched data. Each CALM item would become an Omeka item — but so would the people, roles, activities, published journal articles and other concepts suggested by the data in the spreadsheets. This data model in Omeka S is explored much further in Data Model and API.

We also needed to see these items, as viewable digital objects, as soon as we could— we can’t start prototyping until we have something tangible to work with. IIIF makes this part easy:

We wrote a set of utilities called rs-ingest. Their job:

  • Parse the spreadsheets by scanning them line by line, building up data as you go — not just for the archival item per line, but people, places, journal articles etc. Each new item line adds new information to other entities; as we scan through we’re building up lots of links between entities. Then dump these item entities with their new cross-references to disk as blobs of JSON data that we can more easily process in the next step.
  • For each archival item, find the digitised master images on disk that belong to it. For some items this means just one image (e.g., a photograph) but for much manuscript material there can be many images per item. Register each image with the DLCS service, which creates a IIIF Image API endpoint for each image.
  • Construct a IIIF Manifest using these data and IIIF Image API endpoints, and save the manifests to disk for later use.
  • After a few iterations, when the JSON data coming out looks right and the manifests look right, push the data into Omeka S using the scheme described in Data Model and API.

IIIF First

If you are going to use IIIF to present digital objects, then get them into that IIIF form as soon as you can, so you can look at them. This benefit of designing a site around IIIF resources is underappreciated.

As well as a pile of JSON data on disk and Image API endpoints in the DLCS, rs-ingest also created this a very long HTML page (here the top is seen) with a set of image counts and links for every item:

The links let you see the manifest directly, or load the manifest into the UV and Mirador:

While we’re not going to use the UV or Mirador in Science in the Making, we can now see the objects we’re building the site for, and all our prototyping can use these real digital objects — why use placeholder graphics when we can test our evolving UI properly, with actual images, at whatever size we need?

Omeka S

Now we have a populated Omeka S. We can enrich the records further, and add new ones.

We could build the site right in Omeka S, with the DLCS providing the IIIF resources. After all, this is what Omeka S is for — it’s a collection management system with the ease of use and general web-friendliness of WordPress or Drupal. Unlike CALM, it’s designed to drive Web UI from archival records.

As in Drupal or WordPress, we can define custom content types and the fields that belong to them (e.g., a Person has surname, date of birth etc.), but unlike other CMSes, Omeka S lets us do this custom type definition using imported RDF vocabularies — the fields of our Omeka classes are terms from familiar ontologies like Dublin Core, FOAF, BIBO and others. What’s more, we can import any other vocabularies we like and use them to define our domain model within Omeka S. All the data is then available as JSON-LD via the Omeka S API. This is incredibly powerful, and while it’s not quite the OWL-driven semantic CMS you might hope it to be, it’s still doing a huge amount for us.

However, we have a problem. Our use cases, and the sketch designs emerging from those use cases, are all about connections between objects and people and topics. The UI, adopting the principle of generous interfaces, makes use of every possible connection we know of. Developing directly against Omeka S, every page on the site would trigger an avalanche of subqueries against the API to follow all the possible connections and bring in enough information to encourage further exploration.

We are also going to have user generated content… we want visitors to transcribe manuscripts, tag items, and leave comments. Are these data going to live in Omeka too? In keeping with the conclusions drawn in Data Model and API, these should be Web Annotations, integrated with IIIF space; shared, interoperable, the onness and ofness of the items. We have a W3C Web Annotation Server designed for just these use cases: Elucidate. if we make annotation and tagging tools, they should save their outputs as web annotations, to the annotation server.

We’ve also been asked to include a configurable and specialised search capability, that can search across the archival descriptions as well as the user generated content, with lots of facets. While generous interfaces are intended to give the user something to explore rather than force them to ask questions of the system just to see something, the underlying query requirements of faceted search are similar to those for generous interfaces; many automatically generated further links are like two-level constrained facets, queried for you by the page template, to entice you with further interesting things to look at.

All this points to one conclusion: that the front end is not driven by Omeka S directly, but by Elasticsearch using a projection of data from these other sources, optimised for page generation.

In Omeka, we wouldn’t want to repeat the same biographical snippet and picture for Isaac Newton on every item page, topic page or person page he’s connected with, just to avoid a subquery. Updates would be hell, if not impossible to manage. Elasticsearch is designed for this kind of scenario; we don’t mind repeating data again and again to reduce the number of overall queries required to build a page. What should be normalised in Omeka S can be denormalised in Elasticsearch.

If we drive the site off of Elasticsearch, we can also index all our annotation content too, and integrate it with archival metadata, archival descriptions and editorial text.

Elasticsearch as Sketch

We have one domain model in Omeka S, well suited to managing the enriched archival content. We have the items as digital objects, in IIIF form. We know we want to store anything made by users on the items as annotations. If we bring this all together in Elasticsearch, we can use it as the back end for a simple prototype site, and start building user interface very quickly to see what these items really feel like on the page. Do the sketches from the workshops come to life on the web page the way we expect them too?

Rather than rounds of wireframes and paper, we can use Elasticsearch with a simple web framework (in this case, Flask) as a sketching tool. IIIF revolutionises the approach here, because the thing that was previously most difficult to build a prototype around is handed to us on a plate; we start with the high fidelity digital objects as IIIF, and we sketch our UI around them, with HTML and (in this case) Python code. We’re not talking about dropping fully featured viewers into our pages, just simple HTML, CSS and images — but informed by the IIIF model of the objects. We can try things out really quickly this way, and we can react to how it feels as a web page. We establish a happy cycle — we’re evolving the Elasticsearch model to support the UI as we sketch it in HTML; we’re coming up with new things to query and therefore creating and refining the Elasticsearch queries we’re going to need; these in turn feed back into the model, the information architecture, and the data layout on the page.

The home page and the people page in the data prototype

We could also test the data sources for maps, timelines and graphs before committing to a final design:

Experimenting with a Google Map on a topic page to test the latitude and longitude data on the enriched archival objects

The prototype also allowed us to experiment with the form of item pages, as described in Web pages and viewers, meet things with content!

We tried out the variants of:

  • item page for object with single image (e.g., a photograph)
  • item page for object with multiple images (e.g., a manuscript)
  • view page for page within a multiple-image item

The development of the data required for each page, the Elasticsearch schema, the information architecture of the site, the functionality of search, and the depth of the generous interface approach could all feed into each other by seeing how the prototype worked. This method was particularly useful in designing the Elasticsearch schema and queries, which arevery hard to define up front. As the approach settled down, we started adapting the Royal Society’s styleguide, making a new pattern library for the new components we would need for some of the features of the site.

By using the prototype Flask web application to explore the content, we were also able to refine the data model in Omeka S to make a better editorial experience for managing the content. We did run into a few problems here. While we are big fans of Omeka S, the admin interface is lacking when it comes to finding items quickly. You can search, but our model relies on additional classes (Activities) to connect items and people together, and it can be very confusing for the content editor to find and select the right connections. We were working with beta versions of Omeka S, and the admin tools for selecting items have improved since then. We have also made some Omeka S contributions since that give the admin user more options and control in selecting items for linking together (in particular, better filtering by class and other properties). We think that this UI friction in the editorial back end is all solvable beyond the pilot and is not a reason to abandon the Omeka S management of enriched items, which gives us so much benefit in other respects.

Live synchronisation

The initial development of the prototype UI relied on ad hoc scripts to bulk-load the JSON data produced by the rs-ingest utilities into Elasticsearch. As the schema settled down, we could formalise this process. We had two different scenarios:

  • It should be possible to repopulate Elasticsearch from scratch, from source data in Omeka S and the annotation server, Elucidate, by running a bulk population script (even if it takes a while to run).
  • That won’t work once the site is live — users and content editors need to see the results of changes on the site in real time. This means Elasticsearch needs to be updated in real time, for example when someone tags an item or an editor makes a change to an item in Omeka S.

The latter update is quite complex. We’re taking advantage of denormalized data in Elasticsearch — repetition of snippets as nested documents. In the prototype screen shots above, an item page, with all its attendant generous interface links, is built around a single document returned from Elasticsearch. This gives the site its required speed, but at the expense of increased complexity. We have to carefully script cascading updates to Elasticsearch. If I changed some aspect of Isaac Newton that appeared in nested documents throughout the site (such as the thumbnail URL), I need to update all those nested documents.

This was the initial sketch for how updates work, and how the Elasticsearch that drives the Web UI is kept in sync with the annotation server and the item content in Omeka S:

This developed into the data flow diagram below, which has many moving parts. Content changes in Omeka S trigger webhook events, and content changes in Elucidate broadcast a message to our Iris messenger bus (a feature of our wider DLCS platform). We integrated Omeka S into this system, and developed a process called rs-sync that listens for content changes and makes the appropriate updates to Elasticsearch.

Questions

We learned a lot from the pilot. It will help us start to answer the following:

  • Science in the Making features about 4500 images, across 1300 archival items. How would it scale to deliver 100,000 archival items (the full Royal Society archive)? This is both a UX question and a technical implementation question.
  • We developed the live version of the site as an Omeka S theme. When you visit https://makingscience.royalsociety.org/, you’re looking at an Omeka S site. We benefit from the user accounts, integration with admin and some of Omeka’s web content management features. But… look under the hood and the page generation (the PHP code in the templates in our theme) doesn’t look much like a typical Omeka S site. It isn’t talking to Omeka S to get its data — it’s talking to Elasticsearch. The PHP code for the theme is mostly divorced from Omeka S. More than a few times while building the front end, we asked ourselves “why aren’t we just re-skinning the prototype?” A big question for future development is, would we better off using Omeka S in a “headless” architecture — where admin users still benefit from Omeka’s content management, API and linked data features, but we can do whatever we want in the application that actually delivers the user experience? There are tradeoffs here, but on balance, and for a large site, we think we would go down the headless Omeka S route.
  • The pilot site is disconnected from the archival records in CALM. If someone updates an archival description in CALM, and wants to see that reflected on Science in the Making, they would have to go and edit the corresponding Omeka S item. That’s fine for a pilot project, but probably not acceptable for the full archive. We need to look at how to make the process automated from end to end, with enrichment at the Omeka stage.
  • We also don’t want to do content-migration via spreadsheet! this is very time consuming and error prone. In future iterations would would force ourselves to build whatever tools are necessary (such as data model and UI enhancements in the Omeka S admin site) to make the spreadsheets step unnecessary, and no longer the easier option for the content owners — our challenge to make our lives easier is to make something that’s more compelling, for this content, than the familiarity of a spreadsheet.
  • A huge and as yet untapped potential of the site is the ability to integrate with any archival item, even those outside of the Royal Society’s collections. For example, an editor could create a record in Omeka S for Newton’s manuscript of the Principia, MS/69. While this is in the Royal Society’s collection, the digitised version (its IIIF representation) is provided by Cambridge Digital Library. This doesn’t matter to Science in the Making — it will let you view, transcribe and tag it just the same, as long as there’s an Omeka S entry for it. It means that people, works and objects from anywhere in the world can be part of Science in the Making. As long as they have a IIIF representation, they can be part of the site. However, extensive use of this ability puts more dependency on other people’s infrastructure (to supply the images), which is both good and bad.
  • We need a better transcription user interface! Getting this right is hard, and the version on the pilot site is very much a first step.
  • We don’t have a public API for the content, other than the IIIF endpoints. If someone wanted to take the data and make their own graphs and timelines, they couldn’t do that other than by scraping the pages. We need a descriptive metadata API to accompany the presentation data in the IIIF. Ideally, we would “dogfood” the API by requiring that all user interface on the site is generated from it, rather than direct processing of Elasticsearch results into templates. The underlying data model that would drive such an API is explored in the Data Model and API article, but we didn’t put the time into developing the API in the pilot because our primary questions were about user interface rather than API for developers. That is, the budget went on rapid development of UI rather than refinement of an API to support it. The UI is driven by a conceptual model, but expressing that model in an elegant and well-documented JSON-LD API (which we’d like very much) was not possible in the pilot phase. We should return to the API question when considering next steps.

We hope to explore these questions and more soon!

--

--