Matt McGrattan, Head of Digital Library Solutions, Digirati.
The Paul Mellon Centre for Studies in British Art digitised 250 volumes of the Exhibition Catalogue for the Royal Academy Summer Exhibition, from 1769 to 2018, and commissioned in-depth scholarly articles for each year of the exhibition to coincide with the 250th anniversary of the Summer Exhibition. Digirati were commissioned by the Paul Mellon Centre to build the online version of the project.
The resulting website can be found at: https://chronicle250.com.
- Each catalogue should be available online using the IIIF Image and Presentation APIs. See https://iiif.io for details.
- Each catalague should have searchable full text.
- Exhibitors should be identified in the catalogue text and linked back, via hot-links on the images, to a searchable Index on the main Chronicle250 site.
- Index entries for a given Exhibitor should link to all occurences of that artist in the corpus of Exhibition catalogues.
- Pages for each year with rich scholarly articles.
- Index entries for authors and artworks.
- Thematic indexes across the curated per-year articles.
In building the site Digirati:
- Provided performant versions of the digitised catalogues and illustrations with deep zoom functionality and support for open APIs (https://iiif.io) using the DLCS.
- Created OCR for these images, including 18th century catalogues with historic typefaces.
- Identified exhibitors within the catalogue text and associated exhibitors with regions of images to create hotlinks between the catalogue and the index.
- Provided a usable search experience both within an individual catalogue and across catalogues.
- Created a usable index of Exhibitors.
- Brought the content — catalogues, indexes, scholarly articles — together following Strick and Williams’ design brief to create the Chronicle250 site.
A more detailed technical version of this information can be found here.
If we had been started from scratch with no existing infrastructure, and no existing code base, the Chronicle250 project could, potentially, have been very costly in terms of both time and budget.
However, Digirati provide a hosted cloud based service, the DLCS, designed to be run as a multi-tenant service shared by users who may be unable to, or may not wish to, run their own image hosting infrastructure. The DLCS uses the IIIF APIs, and is based around open standards, so new projects can be built easily on top of the DLCS. The DLCS can also be optionally enhanced with additional services that can enrich content with tags, transcriptions, and search.
The use of the DLCS was a key requirement for this project, as the existence of the DLCS made many of the core functions required for the site do-able without a large amount of infrastructure work or basic software development. Development time, and thus the budget, for this project could concentrate on front end development and enhancements to existing DLCS services around annotation and natural language processing, and not on core image hosting or text processing and indexing functionality.
The DLCS provides services which:
- Transcode images to jpeg2000. (Multi-tenant)
- Generate static thumbnails at multiple resolutions. (Multi-tenant)
- (Scalable) IIIF Image API service. (Multi-tenant)
- Basic IIIF Presentation APIs for create, read, update and delete of IIIF collections, sequences, manifests, and canvases. (Project specific)
- create OCR text from a IIIF Image API source. (Project specific)
- normalise OCR to a standard common format (to ensure the DLCS is OCR-engine agnostic). (Project specific)
- provide OCR text as Open Annotation annotations (for display in IIIF Presentation API 2.x clients which do not support the W3C Web Annotation Data Model). (Project specific)
- do named entity recognition from controlled vocabularies, or from standard neural net models. (Project specific)
- store W3C and OA web annotations in an annotation server. (Project specific)
- index W3C and OA annotations alongside OCR text and provide IIIF Content Search API services. (Project specific)
For the Chronicle250 we were able to use the shared multi-tenant services as-is and then customise the project specific services for Chronicle250 to provide the enhancements we needed to identify, link, and index exhibitors in the digitised versions of the exhibition catalogues.
The catalogues for Chronicle250 span 250 years of Royal Academy exhibitions, which introduces particular demands around OCR quality, as the historic typefaces used are not, typically, OCR’d well by off-the-shelf open source OCR engines like Tesseract or Ocropy. In addition, segmentation of images into blocks, paragraphs, and lines is also difficult because the text is often quite heavily skewed with bleed-through from verso pages, and uneven kerning introduces erroneous whitespace throughout.
We evaluated a number of OCR engines, including:
- Tesseract and Ocropy (for open source, locally hosted engines)
- Microsoft Azure Cognitive Services
- Abby SDK
- Google Vision Document Text Detection
The DLCS already had integrations for Google Vision and Tesseract, and we found that Google Vision scored well compared to other cloud-based services from Microsoft and Abby, and scored significantly higher than Tesseract. A range of typefaces is used throughout the 250 years of catalogues, so doing specific training of Tesseract with glyphs from particular catalogue years would not have scaled well across the entire project, and would have introduced significant additional demands on staff time for results that would not exceed the cloud-services which could be used immediately.
We were able to use the existing DLCS OCR services as-is to do text extraction and normalisation of OCR text without significant customisation for this project.
Natural Language Processing and Named Entity Recognition
We evaluated the use of this service using off-the-shelf neural net models untrained on the Royal Academy corpus, and found that the overall quality of tags produced was not acceptable in terms of the number of artists correctly identified, and in terms of the number of falsely identified non-artists.
A typical catalogue page might contain entries that look like:
And also, other pages within the same volume that look like:
We had to identify the artist names on each page, but also identify when different occurences of a name within the catalogue were references to the same artist. Note the different forms in which an artist’s name might appear.
To improve the results, we:
- Wrote code that parsed known sources of artist data, from: Getty Union List of Artist Names (ULAN) ; Lists of Royal Academy Academicians provided by the Paul Mellon Centre; Lists of Exhibitors (comprehensive until 1990) also provided by the Paul Mellon Centre.
- Generated variant forms of these artist names so that the system correctly identified that
J. Northcote, R.A.and
Northcote, James, R.A.were the same person, and identified that this
James Northcotewas the painter who lived from 1746-1831.
- Wrote code to handle (by normalising and/or ignoring whitespace) the kerning and segmentation issues with historic text.
- Wrote code to filter artsts by date, to ensure that only the relevant artists for a given catalogue year were in the “pool” for tagging.
- Used the Aho-Corasick algorithm to do fast pattern matching of the OCR text with the known list of artist names.
This code was implemented as an enhanced version of an existing Digital Library Cloud Service (DLCS) service, so we did not have to write an entirely new software stack from scratch, and were able to take advantage of existing integrations with OCR services, and annotation servers (for storing the output as annotations on IIIF content).
IIIF Viewing components: Canvas Panel and the ‘PMC’ Viewer
Prior to the Chronicle250 project, Digirati had built a lightweight IIIF Presentation API Canvas viewing component, which supports annotation display called CanvasPanel, and which has been used on projects for the Victoria and Albert Museum, such as their Ocean Liner’s exhibit.
The PMC viewer can be found on Github at: https://github.com/digirati-co-uk/pmc-viewer
Search and Indexing
The full DLCS (Digital Library Cloud Service) provides a IIIF Content Search Service Mathmos which integrates with the DLCS message bus, and indexes both full text (provided by OCR) and annotations (provided by machine generated tags).
However, for the Chronicle250 project, the vision was not to rely on the DLCS for on-going delivery of textual content or services to the viewer. The DLCS text pipeline could be shut down after processing, leaving just the Chronicle250 website/application, and the DLCS IIIF Image API and IIIF Presentation API services running as active services. In addition, the IIIF Content Search service on the DLCS provides basic/generic search services which would not fulfil the full requirements of the Chronicle250 site.
- Article full text
- Index of Exhibitors (via a bulk ingest of W3C Web Annotations from the annotation server)
- Index of Authors
- Index of Illustrations
- Thematic index
The machine identification of exhbitors across the corpus was extremely successful given the relatively short time spent on bespoke software development and R&D.
We were able to succesfully identify 318,690 exhibitors across the catalogues. An upper bound for the possible maximum number of exhibitors, assuming each exhibitor only exhibited once in each catalogue, would be 513,068, however, given how commonly exhibitors exhibit more than once in any given year, we can assume the actual total is certainly lower. There was a very small number of tags for the post 1990 catalogues because we lacked any Exhibitor data for those years.
Using the techniques described in this article offered a very good return on time invested, versus the time it would have have taken to manually tag 400,000–500,000 names in the corpus. Combining these techniques with the services provided by the DLCS made a resource and data heavy project something that was possible to do in a relatively short timescale.
Adam Meszaros, Senior Frontend Consultant. Full stack developer. Chronicle250.com; Site indexes and IIIF Content search services; PMC Viewer; Integration.
Stephen Fraser, Front End Technical Lead. AnnotationStudio; CanvasPanel; PMC Viewer.
Matt McGrattan, Head of Digital Library Services. DLCS text pipeline; Natural langauge processing and tagging; Digirati Product Owner.
Adam Christie, Senior Engineer. DLCS Infrastructure; DevOps;
Ville Vartiainen, Senior UX Consultant. Digirati User Experience.
Ian Farquhar, Head of Project Delivery. Project Management.
Paul Mellon Centre
Tom Scutt, PMC Product Owner and Digital Editor.
Mark Hallett, Sarah Victoria Turner, Jessica Feather, Scholarly Editors.
Baillie Card, Publishing Editor.
Maisoon Rehani, Picture Editor.
Tom Powell, Sean Ketteringham and James Finch, Researchers.
Thérèse Saba, Copyeditor.
Jan Worrall, Indexer.
Strick and Williams
Charlotte Strick, Design.
Claire Williams Martinez, Design.