As a user of the Perl programming language for more than fifteen years, I’m always eager to hear of new ways in which people use Perl. Web development, data munging, system administration — those are the uses I hear of most. But on a recent trip to California I was surprised to hear a friend who is not in the technology field and who did not know of my interest in Perl say, “Oh, Perl! We use Perl all the time!”
In this case Perl is being used by scholars at the Getty Research Institute in Los Angeles who study the German art market in the period from 1933 to 1945. During these years the Nazis looted art works from Jews and others on a massive scale. Much of that stolen art work made its way in to the market and appeared in auction catalogs. In recent years efforts have been made to restore those art works to their proper owners or their descendants. For this restitution to succeed, accurate, accessible data is needed on the German art market during these years.
Assembling that data is the focus of what Getty calls the German Sales Project (GSP). I (JK in the text below) was fortunate to be able to interview Kelsey Garrison (KG) from the Getty’s Collection and Provenance Department, and Suzanne E Michels (SEM) from the Getty’s Information Systems Department about this project. Here’s what I learned about this critical period of twentieth century history — and about Perl’s role in helping us to understand it.
JK: Let me start by asking what the “German Sales Project, 1930–1945” is? How did it come about? And why is it important?
KG: The “German Sales 1930–1945: Art Works, Art Markets, and Cultural Policy” digitization project was a two-year (2011‑2013) collaborative effort among the Getty Research Institute, Heidelberg University Library, the Kunstbibliothek (SMPK) Berlin, and thirty-six contributing libraries and archives in Germany, Austria, and Switzerland whose objective was to create an extensive and easily searchable database of World War II-era German art sale records.
From their ascent to power in 1933 to their defeat in 1945 at the end of World War II, the Nazis looted artwork from Jews, conquered nationalities and others on a massive scale. To address that problem we need ways of determining the provenance of a given artwork; we have to establish its chain of ownership. Establishing provenance is difficult due to the sheer scale of the Nazis’ looting campaigns. Essential source material was scattered in archives and libraries around the world, rarely inventoried, and seldom digitized.
The German Sales Project (“GSP”), with generous support by grants from the National Endowment for the Humanities (NEH), the German Research Foundation (DFG), and the Volkswagen Foundation, sought to address the physical limitations Nazi-era provenance research presents in accordance with the Washington Principles, which states that “relevant records and archives should be open and accessible to researchers, in accordance with the guidelines of the International Council on Archives.”
The GSP yielded almost 3,000 unique catalogues from 177 auction houses as well as duplicate catalogues with valuable handwritten annotations. These catalogues generated one million records, 250,000 of which were considered “high art” objects (paintings, sculptures, drawings, etc.) and were extensively edited before being released online in January 2013.
For the Getty Provenance Index®, the GSP was a conscious effort to expand its purview into the 20th century with a period that carries a great amount of interest for scholars and researchers worldwide because of the great dispossession and displacement of art just before and during World War II. It was also a test case for the implementation of a newly developed collaborative workflow that started with Optical Character Recognition (OCR) software used by our collaborators at the Heidelberg University Library. The scanned data was then cleaned up and parsed by our own, in-house-developed Perl program before being imported into the larger, multi-country and multi-century Sales Catalogs database within the Provenance Index®.
The sheer volume of information would have taken decades to enter previously, and the great success of the GSP 1930‑1945 has led the way to a second project, German Sales Phase II: 1901‑1929, which seeks to fill in the gap between the first phase of the project and the beginning of the 20th century.
JK: Who are the people who contribute to or use the GSP?
KG: Besides its obvious use on a case-by-case basis by provenance researchers and scholars interested in Holocaust restitution, art market scholars can also benefit from the GSP. An important part of our project involves review and normalization of the data collected in the database by German Sales editors and, whenever possible, enhancement with buyer information and prices collected from auction results published by contemporary art journals [such as Weltkunst, Internationale Sammlerzeitung, Pantheon, etc.] Art market researchers can take large data sets with these prices along with their corresponding object specifications to examine the convergence of history, art, economics, and the law for this particularly complicated time and place.
JK: Okay, now let’s get an understanding of the workflow by which the data flows from analog historical documents to digitized information.
SEM: A printed catalog is scanned to create a PDF image file and an OCR text file by our partners in Heidelberg. We have to manually “clean” the text file before we can submit it to our Perl program. The OCR file often contains typographical errors and misreads. Most importantly, cleaning makes certain each lot is sequentially numbered. The program generates an Excel spreadsheet with parsed data for each lot. Editors then review and correct the spreadsheets before the data are imported into our Provenance Index database.
Here is a sample of a catalog displaying one type of format:
And here is a sample of the text file associated with this PDF page:
We were lucky in this case that the OCR was of good quality and did not contain errors. Note that although it is accurate, no identifying information such as bold or an underline is displayed in the OCR text file. There is nothing to distinguish the difference between lots except the aforementioned sequential identifying number.
And here is a snippet of the resulting spreadsheet:
JK: What were your requirements when beginning the project?
SEM: We were given a lot of leeway in our requirements since this hadn’t been done before and we were unsure just what we would be capable of. Quite honestly we did a lot of requirements development as we went along discovering the nature of the data and what we could and couldn’t do with it.
As we progressed we focused our interest on elements that the database users would use for searching. Initially I was told to extract the lot number and the description as a block. I took a lot of license as to what elements I would look for and parse into a spreadsheet and continually found ways to extract more and more data. Eventually I was able to extract over a dozen elements for each lot: PDF page number, text page number, lot number, artist name, artist information, description, object type, materials, dimensions, illustration page number, date of sale, starting price, currency type, catalog section heading. (Note: not all items are represented in the example displayed above.)
I started with one large Perl program, but as I continued to work with other auction houses and found additional formats, it was obvious that I couldn’t expect to modify my large Perl program for each different format variation we encountered. This led me to create “pre-processors” which made it possible to make the data conform to the format the Perl program was looking for instead of trying to change that program for each format. Using this process I was able to easily modify a basic pre-processing scheme to pick up the variations, large or small, between formats and produce a consistently well populated spreadsheet.
Here is an example of the text shown above, now that it has been pre-processed into a format the main Perl program will be able to process:
Line 27: Artist name is now in all caps, on its own line
Line 28–30: Single CRLF between artist name, artist information and descriptive block
Line 31: Lot number now followed by a period
Lines 33–36: Four CRLFs separate lots
Line 39: Dagger symbol represented by the letter “f” will be replaced by editors
Line 39: Diacritic ü replaced by Perl friendly representation Ã¼
Lines 41–43: Multiple lots under a single artist
JK: What went into your choice of Perl as the programming language?
SEM: I came to the project from a scientific setting, having been in aerospace and working with C++. We knew we needed something much more text friendly! I also recognized I’d need to learn the language as I went along, and Perl was easy for me to pick up and let me structure a program so there was no downtime while I climbed the learning curve. Specifically, we chose Perl for these reasons:
- Superior text handling capability Perl is well known for.
- Excellent integration of regular expressions with Perl made it easy to find very specific elements.
- Easily modularized allowing for great program flexibility. When designing the program I knew we would need to make it able to handle a wide variety of ever changing inputs so everything was implemented as functions that could be called, or ignored, by the main program. I couldn’t continually re-program the main Perl program to fit every variety of data so instead I created pre-processors to make data conform to a format the main Perl program understands.
- We knew we would have to create and integrate an Access database for artist name recognition. Perl made this step easy and straightforward.
JK: Can you describe the points in the workflow where you use the Perl programming language, how you do so, and why?
KG & SEM: The entire automated parsing process is accomplished with custom software written in Perl. Before running the Perl we use the PDF image to manually write out a rough map of the catalog structure and take note of important “landmarks” that will become program input arguments. Text files are reviewed for consistent numbering and obvious OCR errors are manually corrected. Once that is done we turn things over to Perl. We run an appropriate Perl pre-processing script to make the data conform to the format the main program expects. After this is accomplished the program analyzes the structure of each block of OCR text, determines what data elements are present. The main program then delivers an Excel spreadsheet wherein data from each lot has been parsed into any of the 26 available columns.
JK: Can you provide a brief overview of the problems in working with historical data which the GSP has had to face?
KG & SEM: Hands down, the lack of consistent formatting from catalog to catalog was the biggest challenge. It meant that we had to be incredibly flexible in our process to “take all comers” and manage to recognize the pertinent data, however it was being presented.
Two major formats presented themselves for individual lots, one where we had an artist name set off from the descriptive text and another where we had only descriptive text. We dubbed them “format 1” and “format 2”, respectively. We could find multiple iterations of both formats within a single catalog, flipping between format 1 and format 2 numerous times or perhaps not at all.
Other challenges which complicated the OCR process include antiquated fonts where similar characters such as 3 and 8 are often confused as are letters such as “c” and “e”. Shadowing or bleed through on pages and/or faded print led to many OCR errors.
Human intervention may have caused problems such as missing or cut pages from catalogs or hand written notes in the margins which would be read by the OCR as junk text which required manual deletion.
The scanning process could also introduce problems if pages were curved or poorly scanned, particularly if gutters were not properly set and resulted in text missing from the OCR text files.
JK: The users of the GSP are, presumably, researchers. Can you describe how researchers are going to interact with the data?
KG: The database is set up so researchers can pinpoint specific artworks or can search on entire classes or styles of art and use that information to understand market trends and historical context. The “Sale Description” link takes the user to bibliographic information about the physical catalogue, and the “Sale Contents” link takes the user to records of the other artwork found in the same catalogue. The GSP also links each record to a PDF scan of the source document, taking the user directly to the page of the corresponding PDF where the lot is found. Researchers using the Provenance Index® can download large sets of data into their own customizable spreadsheets.
JK: Where do you expect the GSP to be headed in the next few years, both in terms of its research focus and in its use of information technology?
KG & SEM: Given the success of Phase I of GSP (1930‑1945) and interest in the content, we were able to acquire funding to initiate a second Phase from 1900‑1929.
After moving into Phase II, we have even enhanced the operation of the Excel spreadsheet by programming live Excel formulas into the spreadsheets via the Perl code. The formulas activate based on keywords in the description. Given the frequency of OCR errors very important keywords are often misread into the description. Once an editor corrects the spelling mistakes in the spreadsheet, the formulas will automatically populate the appropriate column.
Perl has been a constant and reliable tool throughout both phases of the GSP. Whenever we encounter a problem we’re able to fix it and whenever we encounter an opportunity we’re able to seize upon it. We credit the flexibility and ease of use of Perl for this. This means we’re producing more data-rich spreadsheets today and we’re producing them more quickly and with fewer problems than in the earlier phase of this project.
JK: Thank you very much for taking the time to speak with us.
KG & SEM: You’re welcome. If you’d like to find more information about our process and actually visit the Provenance database, we’ve provided you with some links:
- Iris blog: http://blogs.getty.edu/iris/publishing-german-sales-a-look-under-the-hood-of-the-getty-provenance-index/
- Provenance index: http://www.getty.edu/research/tools/provenance/index.html
Thanks to Gail Feigenbaum and William Tronzo for facilitating this interview.