The Royal Academy Summer Exhibition: chronicle250.com

Matt McGrattan

Published in

digirati-ch

27 min readSep 13, 2018

Extended Version

Matt McGrattan, Head of Digital Library Solutions, Digirati.

A shorter version of this article can be found here.

The Paul Mellon Centre for Studies in British Art digitised 250 volumes of the Exhibition Catalogue for the Royal Academy Summer Exhibition, from 1769 to 2018, and commissioned in-depth scholarly articles for each year of the exhibition to coincide with the 250th anniversary of the Summer Exhibition. Digirati were commissioned by the Paul Mellon Centre to build the online version of the project.

The resulting website can be found at: https://chronicle250.com.

Digirati were asked to develop the website from designs by Strick and Williams and to provide the supporting infrastructure for the site using the Digital Library Cloud Service (DLCS).

Requirements

Each catalogue should be available online using the IIIF Image and Presentation APIs. See https://iiif.io for details.
Each catalague should have fully searchable full text.
Exhibitors should be identified in the catalogue text and linked back, via hotlinks on the images, to a searchable Index on the main Chronicle250 site.
Index entries for a given Exhibitor should link to all occurences of that artist in the corpus of Exhibition catalogues.
Pages for each year with rich scholarly articles, with tagging by topic.
Index entries for authors and artworks.
Thematic indexes across the curated per-year articles.

In building the site Digirati:

Provided performant versions of the digitised catalogues and illustrations with deep zoom functionality and support for open APIs (https://iiif.io) using the DLCS.
Created OCR for these images, including 18th century catalogues with historic typefaces.
Identified exhibitors within the catalogue text and associated exhibitors with regions of images to create hotlinks between the catalogue and the index.
Normalised exhibitor names so that artist names that might appear in multiple different forms throughout the corpus of catalogues are identified as the same person.
Provided a usable search experience both within an individual catalogue and across catalogues.
Created a usable index of Exhibitors.
Brought the content — catalogues, indexes, scholarly articles — together following Strick and Williams’ design brief to create the Chronicle250 site.

This article outlines how we successfully solved some of these problems.

N.B. Code samples throughout are intended as simple illustrations of an approach, and are not taken from actual production code.

Catalogues and Illustrations as IIIF

If we had been starting from scratch with no existing infrastructure, and no existing code base, the Chronicle250 project could, potentially, have been very costly in terms of both time and budget.

However, Digirati provide a hosted cloud based service, the DLCS, designed to be run as a multi-tenant service shared by users who may be unable to, or may not wish to, run their own image hosting infrastructure. The DLCS uses the IIIF APIs, and is based around open standards, so new projects can be built easily on top of the DLCS. The DLCS can also be optionally enhanced with additional services that can enrich content with tags, transcriptions, and search.

The use of the DLCS was a key requirement for this project, as the existence of the DLCS made many of the core functions required for the site do-able without a large amount of infrastructure work or basic software development. Development time, and thus the budget, for this project could concentrate on front end development and enhancements to existing DLCS services around annotation and natural language processing, and not on core image hosting or text processing and indexing functionality.

The DLCS provides services which:

Transcode images to jpeg2000. (Multi-tenant)
Generate static thumbnails at multiple resolutions. (Multi-tenant)
(Scalable) IIIF Image API service. (Multi-tenant)
Basic IIIF Presentation APIs for create, read, update and delete of IIIF collections, sequences, manifests, and canvases. (Project specific)
create OCR text from a IIIF Image API source. (Project specific)
normalise OCR to a standard common format (to ensure the DLCS is OCR-engine agnostic). (Project specific)
provide OCR text as Open Annotation annotations (for display in IIIF Presentation API 2.x clients which do not support the W3C Web Annotation Data Model). (Project specific)
do named entity recognition from controlled vocabularies, or from standard neural net models. (Project specific)
store W3C and OA web annotations in an annotation server. (Project specific)
index W3C and OA annotations alongside OCR text and provide IIIF Content Search API services. (Project specific)

For the Chronicle250 we were able to use the shared multi-tenant services as-is and then customise the project specific services for Chronicle250 to provide the enhancements we needed to identify, link, and index exhibitors in the digitised versions of the exhibition catalogues.

For ingest of digitised content, the DLCS provides users with a Portal interface where they can manually upload images, and APIs for bulk upload and creation of IIIF manifests from source images. The DLCS automatically generates, stores, and delivers jpeg2000 images, and derivative thumbnail images for fast delivery of static thumbnails using the IIIF thumbnail service.

Using the DLCS APIs Digirati created basic manifests on the DLCS for each of the 250 catalogue years.

For an example manifest, shown in the Universal Viewer click here.

Royal Academy Catalogue, Universal Viewer

For the same manifest, shown in the Mirador viewer, illustrating the interoperability of IIIF Content click here.

Note, this is a bare manifest, with no metadata added, no OCR text, annotations, or search service.

Starting with basic IIIF Presentation and Image API services, Digirati then used OCR (optical character recognition) and named entity recognition to enrich this manifest with metadata, searchable full text, and annotations of text and exhibitors.

Illustrations as IIIF

All of the images of artworks used throughout the site are also provided as IIIF Image API images. For example click here to view:

Shows two works of art, each of which can be opened for high resolution deep zoom viewing using the IIIF APIs:

Using IIIF throughout made it easier to build a responsive site that worked at different resolution breakpoints while making use of the same image resources on the backend.

OCR

The DLCS can be provisioned, on a project by project basis, with a text pipeline, which uses a suite of micro-services to:

create OCR text from a IIIF Image API source.
normalise OCR to a standard common format (to ensure the DLCS is OCR-engine agnostic).
provide OCR text as Open Annotation annotations (for display in IIIF Presentation API 2.x clients which do not support the W3C Web Annotation Data Model).
do named entity recognition from controlled vocabularies, or from standard neural net models.
store W3C and OA web annotations in an annotation server.
index W3C and OA annotations alongside OCR text and provide IIIF Content Search API services.

For the Paul Mellon Centre we provisioned a custom version of this pipeline with specific features to improve the quality of output for the Royal Academy Exhibition catalogues.

Images in the catalogue are not always easily OCR-able:

In the image above, the text is skewed, there is bleed-through from the verso page, the text contains the long S (ſ), and the spacing/kerning of the typeface is very irregular.

Evaluating OCR options, Digirati tested:

Tesseract 3.x and the LSTM based Tesseract 4
Abby SDK
Microsoft Azure Cognitive Services
Google Vision

We tested for OCR accuracy, using a small sample of pages for which we created ground truth text, and for the number of named entities recognised in the text. Character accuracy was less important than success in identifying named entities, which is a factor of both character accuracy and image segmentation.

For example, comparing Azure and Google on one image:

{
     "image": "https://dlc.services/iiif-img/48/14/5a1d597e-801d-48cb-b348-8f9fb706661e/full/4000,/0/default.jpg",
     "truth": [
      "Arnesby Brown",
      "Arnesby Brown",
      "Arnold Gerstl",
      "Barnard Lintott",
      "Dod Procter",
      "Edna Bahr",
      "F. J. Sedgwick",
      "Frank Eastman",
      "George Harris",
      "Glyn Philpot",
      "Henry Bishop",
      "James A. H. Hector",
      "John Cole",
      "John Simmons",
      "John W. Schofield",
      "Joseph Greenup",
      "Julius Olsson",
      "Kathleen M. Scale",
      "Kenneth Green",
      "L. Campbell Taylor",
      "Laura Knight",
      "Laura Knight",
      "Marjorie Rodgers",
      "Oliver Hall",
      "Owen B. Reynolds",
      "Philip Connard",
      "Richard Einzig",
      "Rowland Hilder",
      "Stanhope A. Forbes",
      "Stanley Grayson",
      "Stanley Spencer",
      "T. Leman Hare",
      "Terrick Williams"
     ],
     "google_missed": [
      "Dod Procter",
      "Rowland Hilder"
     ],
     "google_extra": [],
     "azure_missed": [
      "Dod Procter",
      "Frank Eastman",
      "James A. H. Hector",
      "Rowland Hilder",
      "Terrick Williams"
     ],
     "azure_extra": [
      "James A.",
      "Morning Haze",
      "Pvecious Bane Rowland Ililder",
      "Terriclc Williams"
     ],
     "google_diff": 2,
     "azure_diff": 9,
     "overall best": "google",
     "Accuracy (google)": 93.9,
     "Accuracy (azure)": 72.7
    }

Google Vision correctly recognises the long ſ (s), and although on some individual images Abby or Azure performed best, in general Google Vision was consistently good. Tesseract was not as accurate as any of the commercial cloud services.

Since the DLCS already has good support for Google Vision, and with no other OCR engine showing improved performance, OCR on the Royal Academy catalogues was done using Google Vision Document Text Detection via a DLCS service which:

pulls jobs from the DLCS queue
retrieves images using the IIIF Image API
pushes images to Google using the Cloud Vision API
retrieves Google Vision text output and normalises to a standard internal OCR format

this service also generates OA (Open Annotation) annotations via a proxy service and adds them to the IIIF Presentation API manifest so they are available in any standard IIIF Viewer.

Cropped image showing annotation boxes in Mirador

Note that the kerning in historic typefaces is often very wide, which results in segmentation that identifies a single word as multiple words.

For example, in this image fragment, the OCR (as OA annotations) can be found here.

Thomas Banks is listed in the OCR as:

"TH O MAS B A N K S, "

Eight separate words, instead of two.

Generally, character accuracy, however, is high. There are only two or three wrong characters in the text for the above image.

Named Entity Extraction

The DLCS has a named entity recognition service which uses IIIF, Spacy.io and W3C Web Annotations to tag regions of images with people, places, dates, organisations, and other classes of entity.

The DLCS service is tightly integrated with the DLCS pipeline and with the IIIF APIs and has many DLCS specific features and enhancements. The examples below use simplified code that abstracts away many of these features for clarity.

Basic Approach

For the Royal Academy catalogues we tested Spacy.io’s built in neural models for entity recognition, in combination with DLCS text services for OCR.

DLCS Text Service

The DLCS can provide text for any IIIF Image as either a single block of text, or broken into lines.

For example, for:

The fulltext for the image can be found here.

# dedభరతవరరథంభంథంభంతరంథంశంథంభవంతు esgegro assister 00299090909 * - *- - - *-* * -*-*- * - * - CATALOGUE,& c. Note, The Pictures,& c. marked with an(*) are to be diſpoſed of. Το H N BACO N, George- yard, Oxford- road, Soho. A Bas- relief of the good Samaritan, a model. JOHN BAKER, R. A. Denmark- ſtreet. 2* A piece of flowers. TH O MAS B A N K S, New Bird- ſtreet, Oxford- road. Æneas and Anchiſes eſcaping from Troy, a model. The fame ſubject in another point of time. CHRISTOPHER BARBER, Next door to Young Slaughter' s Coffee- houfe, St. Martin' s- lane. 5 A portrait of a lady, a miniature, in oil. 6' Ditto, a head, ditto. GEORGE BARRET, R. A. Orchard- ſtreet, Portman- ſquare. 7. A view in his Grace the Duke of Buccleugh' s Park, Dalkeith; with part of one of the wings of Dalkeith houſe. 4

And the same text can be found as lines here with bounding boxes provided for each line.

Basic Named Entity Recognition

Spacy.io’s built in named entity recognition is illustrated below.

N.B. the version used on the DLCS contains many IIIF-aware enhancements, integration with other DLCS services, and custom pipeline steps (some of which are illustrated in simplified form below) to produce higher quality output.

Spacy install:

pip install spacy 
python -m spacy download en

Spacy.io simple example:

import spacy
import requests

nlp = spacy.load("en")

text = """# dedభరతవరరథంభంథంభంతరంథంశంథంభవంతు esgegro assister 00299090909 * - *- - - *-* * -*-*- * - * - CATALOGUE,& c. Note, The Pictures,& c. marked with an(*) are to be diſpoſed of. Το H N BACO N, George- yard, Oxford- road, Soho. A Bas- relief of the good Samaritan, a model. JOHN BAKER, R. A. Denmark- ſtreet. 2* A piece of flowers. TH O MAS B A N K S, New Bird- ſtreet, Oxford- road. Æneas and Anchiſes eſcaping from Troy, a model. The fame ſubject in another point of time. CHRISTOPHER BARBER, Next door to Young Slaughter' s Coffee- houfe, St. Martin' s- lane. 5 A portrait of a lady, a miniature, in oil. 6' Ditto, a head, ditto. GEORGE BARRET, R. A. Orchard- ſtreet, Portman- ſquare. 7. A view in his Grace the Duke of Buccleugh' s Park, Dalkeith; with part of one of the wings of Dalkeith houſe. 4"""doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)

Entities found:

dedభరతవరరథంభంథంభంతరంథంశంథంభవంతు GPE
The Pictures,& ORG
Samaritan GPE
JOHN BAKER PERSON
R. A. PERSON
2 CARDINAL
MAS ORG
New GPE
Anchiſes PERSON
Troy GPE
CHRISTOPHER BARBER PERSON
Young Slaughter' PERSON
St. Martin' PERSON
5 CARDINAL
6 CARDINAL
BARRET ORG
R. A. Orchard- PERSON
7 CARDINAL
Grace PERSON
Dalkeith PERSON
one CARDINAL
4 CARDINAL

Spacy.io has identified some non person entities, such as Troy as a geographical or political entity (GPE) along with cardinal numbers and other categories we are not interested in.

If we restrict to just person entities:

import spacy
import requests

nlp = spacy.load("en")

text = """# dedభరతవరరథంభంథంభంతరంథంశంథంభవంతు esgegro assister 00299090909 * - *- - - *-* * -*-*- * - * - CATALOGUE,& c. Note, The Pictures,& c. marked with an(*) are to be diſpoſed of. Το H N BACO N, George- yard, Oxford- road, Soho. A Bas- relief of the good Samaritan, a model. JOHN BAKER, R. A. Denmark- ſtreet. 2* A piece of flowers. TH O MAS B A N K S, New Bird- ſtreet, Oxford- road. Æneas and Anchiſes eſcaping from Troy, a model. The fame ſubject in another point of time. CHRISTOPHER BARBER, Next door to Young Slaughter' s Coffee- houfe, St. Martin' s- lane. 5 A portrait of a lady, a miniature, in oil. 6' Ditto, a head, ditto. GEORGE BARRET, R. A. Orchard- ſtreet, Portman- ſquare. 7. A view in his Grace the Duke of Buccleugh' s Park, Dalkeith; with part of one of the wings of Dalkeith houſe. 4"""doc = nlp(text)

for ent in [e for e in doc.ents if e.label_ == 'PERSON']:
    print(ent.text, ent.label_)

We get:

JOHN BAKER PERSON
R. A. PERSON
Anchiſes PERSON
CHRISTOPHER BARBER PERSON
Young Slaughter' PERSON
St. Martin' PERSON
R. A. Orchard- PERSON
Grace PERSON
Dalkeith PERSON

We can see that the results are less than ideal. We have two of a possible five exhibitors, and we also have seven “persons” which are not exhibitors at all.

For a more modern catalogue page:

The results can be better:

Claire PERSON 
Andrew PERSON 
David Gammon- PERSON 
John R. Merton PERSON 
Glass Screen PERSON 
John Hutton PERSON 
Henry Bird PERSON 
Percy Brown PERSON 
Frederick G. Hughes PERSON 
Frederick G. Hughes PERSON 
Lipmann Kessel PERSON 
M. C. PERSON 
M. B. E. PERSON 
F. R. C. PERSON 
Hertha Köttner PERSON 
Welsh Valley PERSON 
Joan Williams PERSON 
Peter Z. Nel PERSON 
830 Cherubim PERSON 
Malcolm A. Appleby PERSON 
Josephine Pateman PERSON 
Valerie E. Orchard 833 
Pwllygranant PERSON 
Broken Stone — collage PERSON 
Doris M. Whitlock PERSON 
John S. Hawley PERSON 
Caroline C. Thornton PERSON 
Valerie Thornton S PERSON

In this instance we have fourteen correct artists out of seventeen. Also, a number of false positives, that are correctly identified as people, but are subjects of works of art not exhibitors, and several ‘junk’ entries.

Spacy.io is generally quite accurate, see here, with a best performance on untrained named entity recognition tasks of around 85%, which compares favourably with other NER software. Also see, for example: https://towardsdatascience.com/a-review-of-named-entity-recognition-ner-using-automatic-summarization-of-resumes-5248a75de175

The Royal Academy catalogues are challenging, so the measured accuracy with named entity recognition not trained on this specific corpus, is much lower than 85%, perhaps as low as 50%, with many false matches.

By the Paul Mellon Centre’s own estimates there are approximately 256534 exhibits listed in the catalogues, and exhibitors will also appear in directory of exhibitors included in each catalogue. This gives an upper bound of as high as 513,068 potential token exhibitor names in the catalogues. The actual number, however, is lower, as often the same exhibitor appears more than once in a given year, but will only appear in the directory of exhibitors once in that catalogue. Nonetheless, this is a very high number of exhibitor names to capture, and beyond any reasonable number that could be tagged by humans in a short time scale.

To get anywhere close to this number, and to avoid a huge number of false positives, there were a number of things we did to improve the accuracy and quality of the output.

Controlled Vocabulary

One way to improve accuracy, both by eliminating false positives, and by seeding the system with known good entities is to use a source of controlled vocabulary.

We began the project with two possible sources of names for exhibitors, and added a third part way through.

Royal Academicians

Early on in the project, the Paul Mellon Centre were able to provide a list of Royal Academicians. This provided only a small subset of the artists who exhibited over the 250 years, but this was a useful source of names to use when validating the output of the named entity recognition.

Getty ULAN

Also early on in the project, we looked at Getty Union List of Artist Names (ULAN) as a source of artist data. Getty can provide their data as downloadable N-Quads, suitable for machine parsing.

We used the Getty ULAN data, along with the Academicians data, and a list of Exhibitors provided by the Paul Mellon Centre to create a union list of potential exhibitors.

Here is a sample of the data we were able to extract for one artist:

{
 "lastname": "Abbey",
 "firstname": "Edwin Austin",
 "canonical": "Edwin Austin Abbey",
 "profession": "Painter",
 "gender": "",
 "academician": true,
 "address": "54, Bedford Gardens",
 "years": [1885, 1890, 1894, 1896, 1897, 1898, 1899, 1900, 1901, 1902, 1903, 1904, 1906, 1910, 1912],
 "matches": [{
  "lastname": "Abbey",
  "firstname": "Edwin Austin",
  "ulan_id": 500010457,
  "canonical": "Edwin Austin Abbey",
  "academician": true,
  "born": 1852,
  "died": 1911,
  "roles": ["artists", "illustrators", "muralists", "painters", "history artists"],
  "biography": "http://vocab.getty.edu/ulan/bio/4000026720",
  "name_match": 1.0,
  "data_source": "Getty ULAN",
  "role_score": 100
 }, {
  "lastname": "Abbey",
  "firstname": "Edwin Austin",
  "ulan_id": 500010457,
  "canonical": "Edwin Austin Abbey",
  "born": 1852,
  "died": 1911,
  "name_match": 1.0,
  "data_source": "Academicians"
 }]
}

Matches shows the union of the Exhibitors data with Getty ULAN and the Exhibitors list.

Matching these sources of data was useful as it gave us date ranges when someone might be an exhibitor in order to rule out when a candidate found by named entity recognition, only coincidentally matched an Exhibitor. For example, if they had the same name but the artist in question died 50 years before that specific catalogue was published, or where there are more than one artist with the same name in the corpus.

Working with the ULAN data was challenging, as the size of the dataset, and the complexity of building a graph from the N-Quad data — scattered across multiple files — meant that it was computationally slow to build the list. Fortunately, once done, we could work with the data as lightweight JSON.

RA Exhibitors List

Part way through the project, the Paul Mellon Centre were able to provide a list of Exhibitors, from a digitised list of Exhibitors current up until around 1990.

This list was by far the most useful in identifying exhibitors as we were able to filter the named entities to just those artists appearing in the exhibitors list, which almost completely removed the issue of false positives and junk personal names from the set of tags we were able to generate. We worked with a union list of this data and the ULAN and Academicians data.

Name forms

One issue that we had to deal with is that personal names often — more commonly than not — appear throughout the catalogues in multiple formats.

For example:

J. Northcote, R.A. also appears in the index of the same volume as:

Northcote, James, R.A.

We need the custom matching that identified artists from the exhibitors list to be able to handle:

Firstname Lastname format
Lastname Firstname format
Suffixes (such as R.A.)
Prefixes (such as Sir or Dame)
Forms with initials
Forms with a mix of initials and full names
Forms with full names only

And successfully identify them as the same named individual.

We wrote some open source code to generate, automatically, name formats from a provided name (ideally in some canonical form).

Personalnames can be installed using pip.

from personalnames import names 
import json name = "James Northcote, R.A." formats = names.name_initials(name=name,name_formats=["firstnamelastname", "lastnamefirstname"])print(json.dumps(sorted(formats), indent=4))

Will output:

[
    "J. Northcote",
    "J. Northcote, R.A.",
    "James Northcote",
    "James Northcote, R.A.",
    "Northcote, J.",
    "Northcote, J., R.A.",
    "Northcote, James",
    "Northcote, James , R.A."
]

So, given a list of exhibitors by the Paul Mellon Centre, it was possible to match, in the text, artists not just by the canonical form of their name, but also by the other potential forms they may take in the catalogues.

Whitespace/kerning/segmentation issues

As shown above, older typefaces, combined with heavily skewed text can make segmentation and whitespace unpredictable on pages, and an artist name might be missed in the OCR text even when the artist name is known.

If we take:

The OCR text shows:

[ 4] GEORGE B A R R E T, R. A. Weſtbourn- green, near Paddington. A view of a gentleman’ s park, taken from the manfion 9. 10 Its companion, a view of the manfion- houſe, part of the park,& c. from the oppoſite banks of the lake. I A ſtudy from nature, in the mountains of Keſwick, csc Cumberland . JA MES B A R R Y, Queen Ann- ſtreet, Cavendiſh- ſquare. 12 Venus rifing from the ſea. Vid. Lucretius, B. I. and Homer’ s Hymn to Venus. 13. Medea making her incantation after the murder of her children. 14. The education of Achilles. FRANCESCO BARTOLOZZI, R. A. Broad- ſtreet, Carnaby- market. 15 A head of a Madona: a drawing. E. B E L K, Middleton’ s Buildings. 16 Elevation and plan, for a temple in a garden. J O HN. BL A C KB URNE, At Mr. Lipſcomb’ s, near the Pantheon, Oxford- ſtreet. The triumph of mercenary love. 18 The portraits of two children. ANNA BLACK ESLY, Greek- ſtreet, Soho. Portrait of a gentleman .

We can see that the names appear as:

In most cases there are intrusive whitespaces that would prevent named entity recognition, or pattern matching, from identifying artists.

Ignore whitespace

One successful approach we adopted was to strip all whitespace out of both the text and the possible artist names, to avoid intrusive whitespace preventing matching. The personalnames name formatting code can add a no whitespace version of the name variants to the set.

OCR text without whitespace:

txt = """[ 4] GEORGE B A R R E T, R. A. Weſtbourn- green, near Paddington. A view of a gentleman' s park, taken from the manfion 9. 10 Its companion, a view of the manfion- houſe, part of the park,& c. from the oppoſite banks of the lake. I A ſtudy from nature, in the mountains of Keſwick, csc Cumberland . JA MES B A R R Y, Queen Ann- ſtreet, Cavendiſh- ſquare. 12 Venus rifing from the ſea. Vid. Lucretius, B. I. and Homer' s Hymn to Venus. 13. Medea making her incantation after the murder of her children. 14. The education of Achilles. FRANCESCO BARTOLOZZI, R. A. Broad- ſtreet, Carnaby- market. 15 A head of a Madona: a drawing. E. B E L K, Middleton' s Buildings. 16 Elevation and plan, for a temple in a garden. J O HN. BL A C KB URNE, At Mr. Lipſcomb' s, near the Pantheon, Oxford- ſtreet. The triumph of mercenary love. 18 The portraits of two children. ANNA BLACK ESLY, Greek- ſtreet, Soho. Portrait of a gentleman ."""

txt_no_ws = "".join(txt.strip().split())

print(txt_no_ws)

Produces:

[4]GEORGEBARRET,R.A.Weſtbourn-green,nearPaddington.Aviewofagentleman'spark,takenfromthemanfion9.10Itscompanion,aviewofthemanfion-houſe,partofthepark,&c.fromtheoppoſitebanksofthelake.IAſtudyfromnature,inthemountainsofKeſwick,cscCumberland.JAMESBARRY,QueenAnn-ſtreet,Cavendiſh-ſquare.12Venusrifingfromtheſea.Vid.Lucretius,B.I.andHomer'sHymntoVenus.13.Medeamakingherincantationafterthemurderofherchildren.14.TheeducationofAchilles.FRANCESCOBARTOLOZZI,R.A.Broad-ſtreet,Carnaby-market.15AheadofaMadona:adrawing.E.BELK,Middleton'sBuildings.16Elevationandplan,foratempleinagarden.JOHN.BLACKBURNE,AtMr.Lipſcomb's,nearthePantheon,Oxford-ſtreet.Thetriumphofmercenarylove.18Theportraitsoftwochildren.ANNABLACKESLY,Greek-ſtreet,Soho.Portraitofagentleman.

We can do the same thing with a list of possible names (in the production site we are using the entire list of hundreds of thousands of artists):

names = [
    "GEORGE BARRETT",
    "JAMES BARRY",
    "FRANCESCO BARTOLOZZI",
    "JOHN BLACKBURNE",
    "ANNA BLACKESLY",
]

no_ws_names = [(x, "".join(x.strip().split())) for x in names]

print(no_ws_names)

Which produces:

[
    ("GEORGE BARRETT", "GEORGEBARRETT"),
    ("JAMES BARRY", "JAMESBARRY"),
    ("FRANCESCO BARTOLOZZI", "FRANCESCOBARTOLOZZI"),
    ("JOHN BLACKBURNE", "JOHNBLACKBURNE"),
    ("ANNA BLACKESLY", "ANNABLACKESLY"),
]

Using these names to match against the text (using FlashText which implements the fast Aho-Corasick algorithm):

import json
from flashtext import KeywordProcessor

txt = """[ 4] GEORGE B A R R E T, R. A. Weſtbourn- green, near Paddington. A view of a gentleman' s park, taken from the manfion 9. 10 Its companion, a view of the manfion- houſe, part of the park,& c. from the oppoſite banks of the lake. I A ſtudy from nature, in the mountains of Keſwick, csc Cumberland . JA MES B A R R Y, Queen Ann- ſtreet, Cavendiſh- ſquare. 12 Venus rifing from the ſea. Vid. Lucretius, B. I. and Homer' s Hymn to Venus. 13. Medea making her incantation after the murder of her children. 14. The education of Achilles. FRANCESCO BARTOLOZZI, R. A. Broad- ſtreet, Carnaby- market. 15 A head of a Madona: a drawing. E. B E L K, Middleton' s Buildings. 16 Elevation and plan, for a temple in a garden. J O HN. BL A C KB URNE, At Mr. Lipſcomb' s, near the Pantheon, Oxford- ſtreet. The triumph of mercenary love. 18 The portraits of two children. ANNA BLACK ESLY, Greek- ſtreet, Soho. Portrait of a gentleman ."""


txt_no_ws = "".join(txt.strip().split())

names = [
    "GEORGE BARRET",
    "JAMES BARRY",
    "FRANCESCO BARTOLOZZI",
    "JOHN BLACKBURNE",
    "ANNA BLACKESLY",
]

no_ws_names = [(x, "".join(x.strip().split())) for x in names]

keyword_processor = KeywordProcessor()

for x in no_ws_names:
    keyword_processor.add_keyword(x[1], x[0])
    
keywords_found = keyword_processor.extract_keywords(txt_no_ws)

print(json.dumps(keywords_found, indent=4))

Returns:

[
    "GEORGE BARRET",
    "JAMES BARRY",
    "FRANCESCO BARTOLOZZI",
    "ANNA BLACKESLY"
]

Which successfully matches four of the five artists, irrespective of whitespace and segmentation issues, and misses the fifth because of an extra period/full stop that appears in John Blackburne’s name.

Ignoring periods/full stops would introduce far more errors; matching across sentence boundaries, for example, or punctuated lists, so four out of five is about as good as we can expect using this technique.

Lines and blocks

We also parsed the text both as blocks (as above) and as lines, which produces more ‘hits’ than parsing as blocks or lines alone. Sometimes the text that runs across a line boundary could create a false positive, for example, if the text was:

A landscape — James Smith
David Collins (a portrait) — Mary Jones

Then potentially the text, parsed only as a block, might mistakenly identify a (fictional) artist called ‘James Smith David’, and miss ‘James Smith’. Similarly parsing only as lines might miss artists, if the segmentation (caused by skewing or whitespace) has mistakenly broken a single line into more than one in the OCR.

The DLCS named entity recognition service was updated to use existing APIs provided by the DLCS service which produces the OCR in order to parse the text as both lines and blocks and merge the results to produce a more comprehensive set of exhibitors.

Dates

One extremely useful piece of information that can be extracted from the ULAN data, and from the PMC provided Exhibitors list is the dates in which an artist either lived and worked (ULAN) or exhibited (PMC).

Given 250 years of artists, it is common to find artists with the same name, for example, father and son, or just artists with the same first and last name.

Using the dates available from the data, we filtered the list of tags for a given year to just those artists who were either known to exhibit in that year, or who could have exhibited in that year.

We implemented filtering by date in the tagging engine, using the IIIF navDate in each IIIF Manifest to restrict the possibilities.

Putting it all together

If we take a sample image:

There are 21 possible artists that could be identified on this page.

Parsing as blocks and as lines
Parsing with and without whitespace
Using a known list of exhibitors
Using automatically generated name variants
Filtering using the date of the catalogue

We identified:

Wilfred Fairclough (Painter,Engraver)
Sidney E. Huxtable (Painter)
James A. Woodford (Painter,Sculptor) 
Robert F. Micklewright (Painter) 
James Newton (Painter) 
Gordon L. Davies (Painter) 
John Doyle (Painter) 
Peter H. Harman (Painter) 
Hilda Chancellor Pope (Painter) 
Donald Bosher (Painter) 
Rosemary Allan (Painter) 
Robert J. Swan (Painter) 
Robert F. Micklewright (Painter) 
Patrick D. Nairne (Painter) 
Violet Fuller (Painter) 
J. Humphrey Spender (Painter) 
Noel G. Baguley (Painter)

That is, 17 of the possible 21 artists on the page, with no falsely identified artists, or an accuracy of ~81%. Furthermore, we identified these artists and their role/artist type, which was extracted from the Getty ULAN data and the list of exhibitors provided by the Paul Mellon Centre.

Using basic neural net named entity recognition, with none of the additional enhancements developed for this project, the list returned would be:

Patrick D. Nairne 234 Albi
Gordon L. Davies
Leonard G. Brammer
James Newton
Noel G. Baguley
Willow Trees
Robert Swan 240 Vines
Rocca
Pietra
Sidney E. Huxtable
John Doyle 242
Peter H. Harman
James Woodford
R. A.
Violet Fuller
Rosemary Allan 246
Robert F. Micklewright 247 Reed Clump- conté
Margaret Roroney
Robert F. Micklewright
Alice R. Boothby
Swanscombe Philip Carroll

That is, 10 correct artists, and 11 where the artist is either not an artist at all, or has been incorrectly identified.

The updated service offers a considerable improvement over basic named entity recognition. This model could be extended for different data sets, using different sources of controlled vocabulary, and we would expect to see similar improvements in overall accuracy.

Interestingly, the out-of-the-box natural language processing has picked up one artist missed by the more refined process. This suggests, that with some more time spent on the code which merges vanilla natural language processing and the custom processing pipeline, we could have slightly improved the overall score.

Rejected Options

While assessing our workflow, we considered and rejected a number of additional options.

Italics detection

For a relatively large set of catalogues, there is a consistent pattern, in the lists of exhibits, of using italics to identify artists. We considered using computer vision techniques to identify italics fonts, in order to restrict the text we extracted entities from to just italics text.

However, we rejected this as the investment in time did not look promising. In particular, artists do not appear in italics in earlier catalogues, and also consistently do not appear as italics in catalogues in the lists of committee members and of exhibitors that appear in the front and end matter of the volumes.

Page segmentation / splitting

Similarly, in a large set of catalogues, exhibitors typically appear at the right hand side of the page with one exhibit per line. We considered splitting the images, or weighting the results based on the position of the text on the page; something we know because the OCR service on the DLCS can return coordinates and character positions for text, and we know the overall dimensions of the image and the character count per line.

However, as with italics, this pattern does not apply in the front or end matter of the volumes, in some cases exhibits break across multiple lines, and in earlier volumes artists do no consistently appear at the right. A small test set of images was created, and this approach was tested, before being rejected as offering little improvement over parsing the entire text as a mixture of blocks and lines, and ignoring the relative position of candidate artists within the line or page.

Training neural nets

Spacy does provide APIs for training the neural net model. For other sources of vocabulary, where the entity class was a new class, not already part of the training model, it would potentially make sense to create ground truth data and train the model to create a new version of the engine that supported this new entity class.

In the case of the Royal Academy catalogues, we did not have a large enough set of truth data, and the entity class was an existing class, which we needed to refine or improve, rather than a new class of data. So adopting methods to filter and improve this existing set were more time effective than training a new model from scratch.

Implementation in the DLCS

The DLCS has a service (known internally Montague) which can parse the OCR text provided by the DLCS’s OCR service (known internally as Starsky), using Spacy to do named entity extraction (and if necessary parts of speech tagging to, say, restrict named entities to just noun phrases), and serialise the results as W3C Web Annotations which are posted to the DLCS annotation server (aka Elucidate).

Queues

The DLCS is an event driven system, consisting of many services of varying size which receive messages from a queue and act in response to those messages.

The DLCS named entity service watches for new textual content being advertised via messages in the queue from the DLCS OCR service. These messages contain the canvas and manifest @id, which the DLCS named entity service dereferences to fetch IIIF Presentation API content, and then uses the DLCS OCR service API to fetch OCR text for canvases.

Working in this way makes it easy to add new services to the DLCS, or to push updated versions of existing services.

Pipelines

Spacy.io can implement extensions as pipeline steps. Our custom enhancements to the entity extraction process are implemented as Spacy.io pipeline components, for performance, and for tight integration with parts of speech tagging, and named entity extraction provided by Spacy’s built in neural net models.

We built pipeline components for:

Ingesting custom vocabulary from JSON or CSV.
Enriching the custom vocabulary with variant forms (as per the personal name formats described above).
Using the Aho-Corasick algorithm to do fast pattern matching of text with custom vocabulary.
Filtering entities by type, e.g. to just Person tags, or just Person and Date tags.
Filtering entities by date
Interacting with Starsky to fetch XYWH bounding boxes for entities on IIIF Image API images.
Combining neural net based entities with pattern matching entities.
Ignoring or removing Stopwords from entities.

We also modified existing code for serialising entities as W3C Web Annotations to create editable versions of those annotations which could be understood by the Annotation Studio, which could be used by Paul Mellon Centre staff to correct machine generated tags, or create new tags for missed exhibitors.

Annotation Studio

We built an editing interface for Paul Mellon Centre staff to update content, if required. This editing interface used the Annotation Studio and a bespoke serialisation of the extracted artist data to render editable annotations in the format expected by the Annotation Studio.

Annnotation Studio based editor for Chronicle250

The editable annotations were then transformed via a proxy to create the simple highlighting annotations that are made available to the viewer.

{
 "@id": "https://chronicle250.com/pmc/annotationlist/67c30f2efa07b2631ffec5299a27690c/2",
 "@type": "oa:Annotation",
 "motivation": "oa:linking",
 "on": "https://presley-pmc.dlc.services/iiif/raa/Vol7/canvas/c1#xywh=664,2399,756,109",
 "resource": {
  "@id": "https://www.chronicle250.com/index/exhibitors/B#John%20Bacon+%28Sculptor%29",
  "label": "Bacon, John (Sculptor) Exhib. 1769-1799"
 }
}

Overall Results

The machine identification of exhbitors across the corpus was extremely successful given the relatively short time spent on bespoke software development and R&D.

We were able to succesfully identify 318,690 exhibitors across the catalogues. An upper bound for the possible maximum number of exhibitors, assuming each exhibitor only exhibited once in each catalogue, would be 513,068, however, given how commonly exhibitors exhibit more than once in any given year, we can assume the actual total is certainly lower. There was a very small number of tags for the post 1990 catalogues because we lacked any Exhibitor data for those years.

Using the techniques described in this article offered a very good return on time invested, versus the time it would have have taken to manually tag 400,000–500,000 names in the corpus. Combining these techniques with the services provided by the DLCS made a resource and data heavy project something that was possible to do in a relatively short timescale.

To measure the output, the Paul Mellon Centre produced statistics in Google Data Studio, to show the accuracy and distribution across the entire corpus. Click here to view the data.

IIIF Viewing components: Canvas Panel and the ‘PMC’ Viewer

In order to provide the results of the tagging process alongside the IIIF Image API images, Digirati built a bespoke IIIF Presentation API viewer for the Chronicle250 site.

Prior to the Chronicle250 project, Digirati had built a lightweight IIIF Presentation API Canvas viewing component, which supports annotation display called CanvasPanel, and which has been used on projects for the Victoria and Albert Museum, such as their Ocean Liner’s exhibit.

For the Chronicle250 project, we took CanvasPanel and added additional support for:

IIIF Content Search API
Multi-page documents with navigation
Highlighting/linking annotations
Support for search queries being passed in from the Chronicle250 Index of Exhibitors.

The PMC viewer can be found on Github at: https://github.com/digirati-co-uk/pmc-viewer

Search and Indexing

The full DLCS (Digital Library Cloud Service) provides a IIIF Content Search Service Mathmos which integrates with the DLCS message bus, and indexes both full text (provided by OCR) and annotations (provided by machine generated tags).

However, for the Chronicle250 project, the vision was not to rely on the DLCS for delivery of textual content or services to the viewer. The DLCS text pipeline could be shut down after processing, leaving just the Chronicle250 website/application, and the DLCS IIIF Image API and IIIF Presentation API services running as active services. In addition, the IIIF Content Search service on the DLCS provides basic/generic search services which would not fulfill the full requirements of the Chronicle250 site.

Instead, for Chronicle250 we built a bespoke Elasticsearch based index which contained:

Article full text
Index of Exhibitors (via a bulk ingest of W3C Web Annotations from the annotation server)
Index of Authors
Index of Illustrations
Thematic index

And which provided both the IIIF Content Search to the PMC Viewer, and also general search services and indexing on the main Chronicle250 site.

Digirati

Adam Meszaros, Senior Frontend Consultant. Full stack developer. Chronicle250.com; Site indexes and IIIF Content search services; PMC Viewer; Integration.

Stephen Fraser, Front End Technical Lead. AnnotationStudio; CanvasPanel; PMC Viewer.

Matt McGrattan, Head of Digital Library Services. DLCS text pipeline; Montague: natural langauge processing and tagging; Digirati Product Owner.

Adam Christie, Senior Engineer. DLCS Infrastructure; DevOps;

Ville Vartiainen, Senior UX Consultant. Digirati User Experience.

Ian Farquhar, Head of Project Delivery. Project Management.

Paul Mellon Centre

Tom Scutt, PMC Product Owner and Digital Editor.

Mark Hallett, Sarah Victoria Turner, Jessica Feather, Scholarly Editors.

Baillie Card, Publishing Editor.

Maisoon Rehani, Picture Editor.

Tom Powell, Sean Ketteringham and James Finch, Researchers.

Thérèse Saba, Copyeditor.

Jan Worrall, Indexer.

Strick and Williams

Charlotte Strick, Design.

Claire Williams Martinez, Design.

User Experience

http://www.unaffiliatedworks.com/