Harvard LIL Fellowship — Part 4

Matt Miller
3 min readSep 3, 2017

--

The fellowship ended in August, but the blog continues! I have notes from the whirlwind of activities that happened in the last few weeks of the fellowship that need to be written up to complete this series.

As discussed in part 3 I found I had two problems: Extracting messy judicial names from the OCR text and also trying to find an authority file for Illinois appellate court judge names. In addition to these practical hurdles I had larger conceptual question of what to do with this data, how can it be leveraged, how could these court opinions fit into the linked data ecosystem?

I built out my library to process the text and extract just the last name of the judge(s) who had authored the opinion. This works with the plethora of different formatting and arrangements used over the last hundred years. Now, for the Illinois State Supreme Court I have the string of the judge’s last name and the possible URIs for that string. I have the URIs because I know when the court opinion was issued and who was a State Supreme Court judge at the time. Using this authority file (also linked to Wikidata and courtlistener.com) I can reasonably guess who the judge name is, even with problematic OCR.

The process narrows down the options between the last name string and the possibilities. If a direct obvious match is not available it uses dice’s coefficient (found in many Node string libraries) to find the best possible match. I made a video of this process to show how well/poorly it works:

33 minuets of hot string matching!!!!

This means I have gone from OCR text to an URI for each judge in the Illinois State Supreme Court opinions. I will demonstrate what this potentially enables in the next installment.

To facilitate this same process for the Appellate court opinions I need an authority file. Unfortunately these names, from the late 19th century onward are not readily available. I mentioned last time of building a tool to extract this information. I expanded this prototype to be a reusable application I called the Front-Matter Extractor:

Demo of the front-matter extractor — http://front-matter-extractor.herokuapp.com

The tool reads a bunch of images and corresponding ALTO OCR xml files from an Amazon S3 bucket and presents an interface to the users to click the detected regions and store that information. Github repo.

You can play with this example here: front-matter-extractor.herokuapp.com

Adam did flag the problem that there would likely be many duplicate front-matter pages that occurs in different volumes. I have a non-implemented solution to this: Since we know what the OCR is on the page we can index that once it is complete. When starting a new volume it can see if a page similar to current one has already been worked. If so it can prompt the user to verify and skip to the next non duplicate volume.

This tool falls into what I like to call a “staff-sourcing” crowdsourcing application. For example a group of 20 staff members could sit down and knock out a task like this, which maybe hard to market or is too esoteric for the general public. I like the idea of building small reusable tools that structure and facilitate manual tasks.

The result of using this tool would be an authority file for the appellate court judge names that could then be used to map the opinions to. The major difference here is that these names likely have no Wikidata or other authority URI already available to connect to. Fortunately we can now easily mint Wikidata entities, with some basic metadata we know about these individuals (they were Illinois appellate court judges, active years, etc)

In the next final installment I will demo what can be leveraged when you make these connections and a look at larger plans of how this data could be integrated into the LOD ecosystem.

--

--