CIDOC vs Solr — Steamrolling your model

Matt Miller
Apr 10, 2017 · 5 min read

The Villa I Tatti recently launched a new portal for Bernard Berenson’s Drawings of the Florentine Painters which represents Renaissance drawings from various collections recorded by Berenson in three (1903, 1938, 1961) editions. Different folks worked on the project, one of my major tasks was writing an ingest process to go from the RDF triples to a Blacklight portal. This was a novel problem for me, going from a SPARQL endpoint to a Solr index presented it’s own unique difficulties.

Leonardo da Vinci’s Small nude figures swinging axes; notes and diagrams on the theory of light and shade. More.

Problem Space:

  • 300K triples
  • Data model is CIDOC
  • Four separate graphs representing the three catalog editions and a “project data” graph
  • Data in the triple store is expected to change or enriched over time
  • External URIs are used (ULAN, VIAF, AAT) and remote data would like to be leveraged in the portal

The data ingest process is a rails rake task that takes the data from the SPARQL endpoint, transforms it into flat Solr document and indexes.

The major obstacles are the complexity of the data model and how to get the data out of the triple store and manipulate it. There were three options for the later problem:

  • Query the remote SPARQL endpoint for each resource to acquire all the data needed for each record.
  • Download each graph, create a local triple store in memory and query that with SPARQL.
  • Use native ruby datatypes like hashes and arrays to build lookup tables need for building each record.

This required data gymnastic comes from the first problem. CIDOC can become fairly complex in order to model the data. For example to attach an image to a resource you have to link together several entities with multiple predicates and track down the final literal value to the image.

Partial data model diagram by Alexandra Provo

This is more or less the same for all of the properties of a cataloged drawing. In addition there are three different graphs of data for the three editions, meaning that the same data could be repeated throughout the three editions or it might change from one to another. A example of this is attributing a drawing to one artist in the first edition but then changing that attribution to another in a later one.

At this point I wish I could say I came up with some elegant beautiful solution, but like most problems just doing the most straightforward least complex thing will get you where you need to go. The resulting rake task follows this logic:

  1. Bulk download each of the three edition graphs from the SPARQL endpoint.
  2. Build lookup hashes with native ruby for all the multiple entities so you can assign properties to the resources, not entities attached to the resource.
  3. Resolve any external URIs like VIAF and cache that data locally in the file system, expire the cache if it is over 30 days old.
  4. Download the project data graph (contains image links, museum links, etc).
  5. Build a Solr document using all the in memory data cherry picking the best data (1968 title vs 1903 missing title for example) and also build the document so it contains repeated data elements for each edition in the Solr document.

Here is the Solr doc for the drawing above:

{
"id": "0001192-Berenson",
"language_facet": [
"By",
"By",
"By"
],
"edition_facet": [
"Berenson 1903",
"Berenson 1938",
"Berenson 1961"
],
"owner_facet": [
"Royal Library (Windsor)"
],
"technique_facet": [
"ink"
],
"technique_recto_t": [
"ink"
],
"technique_recto_uri": [
"http://vocab.getty.edu/aat/300015012"
],
"technique_verso_t": [
"ink"
],
"technique_verso_uri": [
"http://vocab.getty.edu/aat/300015012"
],
"has_image_i": 1,
"museum_url_s": "https://www.royalcollection.org.uk/collection/search#/1/collection/919149/recto-notes-on-optics-etc-with-diagrams-verso-notes-on-optics-etc-with-diagrams",
"inventory_number_s": "19149",
"inventory_number": "19149",
"owners_label_t": [
"Royal Library"
],
"owners_uri_t": [
"http://viaf.org/viaf/153159158/"
],
"owners_geo_label_t": [
"Windsor"
],
"owners_geo_uri_t": [
"http://sws.geonames.org/2633842"
],
"owners_label_display": "",
"contributors_t": [
"Leonardo da Vinci"
],
"contributors_alt_t": [
"Leonardo",
"Vinci, Leonardo da",
"Leonardo, da Vinci",
"da Vinci, Leonardo",
"Léonard de Vinci",
"Léonardo de Vinci",
"Leonardo di Ser Piero da Vinci",
"Leonardo Da Vinci",
"Da Vinci, Leonardo",
"da Vinci Leonardo",
"Lionardo",
"Liyūnārdū Dāvīnshī",
"Vinchi, Leonardo da",
"Leonardo da Vinchi",
"Léonard",
"Lieh-ao-na-to",
"Леонардо да Винчи",
"Леонардо",
"לאונרדו",
"ליאונארדו",
"ליאונרדו דא וינצ׳י",
"ליאורנרדו",
"李奥纳多·达·文西",
"レオナルド・ダ・ヴィンチ",
"Leonardo de Vinza",
"Leonardo D'Vinci",
"Leonardo da Vince",
"Leonardo d'Avinci",
"Leonard Davincy",
"Leonardo De Vinci"
],
"contributors_ulan_t": [
"http://vocab.getty.edu/ulan/500010879"
],
"contrubtor_preflabel": "",
"title_sort": "Appunti di ottica / Piccole figure nude che manovrano scuri; appunti e diagrammi sulla teoria della luce e delle ombre.",
"bcn_t": [
"1192",
"1192",
"1192"
],
"bnc_1903_s": "1192",
"verso_title_1903_t": "--",
"recto_title_1903_t": "Small figures swinging axes. Some manuscript notes.",
"verso_note_1903_t": "--",
"recto_note_1903_t": "--",
"verso_figures_1903_t": [
],
"recto_figures_1903_t": [
],
"page_number_1903_s": "",
"image_plate_display": null,
"image_bm_display": null,
"image_plate_number_s": null,
"image_plate_number_roman_s": null,
"image_thumb_display": null,
"image_page_display": null,
"museum_image_url_display": [
"https://d9y2r2msyxru0.cloudfront.net/sites/default/files/collection-online/0/2/282067-1464772643.jpg"
],
"museum_image_text_display": [
"Recto"
],
"author_1903_display_s": "Leonardo da Vinci (By)",
"bnc_1938_s": "1192",
"verso_title_1938_t": "Small nude figures swinging axes; notes and diagrams on the theory of light and shade.",
"recto_title_1938_t": "--",
"verso_note_1938_t": "--",
"recto_note_1938_t": "Vertically to them in red chalk a nude seated figure bending down, not by Leonardo as Clark correctly affirms.",
"verso_figures_1938_t": [
],
"recto_figures_1938_t": [
],
"page_number_1938_s": "",
"author_1938_display_s": "Leonardo da Vinci (By)",
"bnc_1961_s": "1192",
"verso_title_1961_t": "Piccole figure nude che manovrano scuri; appunti e diagrammi sulla teoria della luce e delle ombre.",
"recto_title_1961_t": "Appunti di ottica",
"verso_note_1961_t": "--",
"recto_note_1961_t": "--",
"verso_figures_1961_t": [
],
"recto_figures_1961_t": [
],
"page_number_1961_s": "",
"author_1961_display_s": "Leonardo da Vinci (By)",
"title_t": "Appunti di ottica / Piccole figure nude che manovrano scuri; appunti e diagrammi sulla teoria della luce e delle ombre.",
"title_display": "Appunti di ottica / Piccole figure nude che manovrano scuri; appunti e diagrammi sulla teoria della luce e delle ombre.",
"subtitle_t": "-- / Small nude figures swinging axes; notes and diagrams on the theory of light and shade.",
"subtitle_display": "-- / Small nude figures swinging axes; notes and diagrams on the theory of light and shade.",
"subject_topic_facet": [
"Leonardo da Vinci"
],
"author_display": "Leonardo da Vinci (By)",
"author_sort": "Leonardo da Vinci (By)",
"author_t": [
"Leonardo da Vinci"
],
"authorsuggest": "Leonardo da Vinci (By)",
"thumbnail_url_s": "https://s3-eu-west-1.amazonaws.com/florentinedrawings/thumbs/0001192-Berenson.jpg"
}

You’ll noice the multiple suffix years “verso_title_1938_t” vs “verso_title_1968_t” for example. You will also notice there are recto and verso data elements since the resource often is comprised of two sides which could have conflicting data, one side done with ink and other other wash, for example. We can even mix in external data like alt contributor labels from VIAF or ULAN. View how this record is displayed.

Fortunately CIDOC is very capable in modeling such complex data, but unfortunately when it gets collapsed into a single document that dimensionality needs to be reduced to work with Solr resulting in slightly repetitive hash.

I think at this scale this approach works well, the dataset is small enough to be held in memory, meaning data can change in the triple store and can be reindexed in a few minutes. An amount of opaqueness is introduced in the translation but that seems to be unavoidable when going from one complex representation to a flat one.

Browse the portal, and explore the data!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade