Harvard LIL Fellowship — Week 3

Matt Miller
3 min readJul 8, 2017


Belated update of week 3: Name extraction and broken bones.

As discussed in week 2 extracting judge names from the case law corpus and reconciling them to URIs is the first step in my process. While the dataset is structured into fields those data elements are still free text. With opinions from the late 19th century until today there is a plethora of writing styles used in the free text. In addition to this problem there are numerous OCR errors in all fields. For example here are all the judge names from the Illinois dataset. You’ll notice various problems, we want to get to the point where I can write regular expressions to extract names in various formats for example:

Multiple names:
"Mr. Justice Underwood, Mr. Chief Justice Solfis­burg, and Mr. Justice Schaefer,"
"Lawrence, Chief Justice, Scott, Justice, and McAllis­ter, Justice"
Various abbreviations for justice positions:
"Perlin, C.J."
"O'Connor, P. J."
"Parmer and Cooke, JJ.:"
Non-name decisions:
"Opinion per Curiam."

Before we can get to this task we need to clean up the OCR where possible. There are infinite types of errors but there are many that repeat based on common OCR errors in this corpus “MB.” or “ME.” instead of “MR.” Or a 100 different errors of “justice”:

These can be cleaned up in preprocessing using a reusable library. Once we have the names extracted we can link them to URIs, but as I mentioned last time I only have the complete list of Supreme Court judges for Illinois. This week I added Wikidata Q IDs to the dataset via the Wikimedia API which now incorporates Q IDs in their response for example


batchcomplete: "",
query: {
normalized: [{
from: "Joseph_Phillips_(judge)",
to: "Joseph Phillips (judge)"
pages: {
31025091: {
pageid: 31025091,
ns: 0,
title: "Joseph Phillips (judge)",
pageprops: {
defaultsort: "Phillips, Joseph",
wikibase_item: "Q6286242"

This data can also be pulled out of the Wikidata SPARQL endpoint.

But there still is the problem of non-supreme judges. The majority of the opinions are written by the state appellate court so we need to find these judge names as our base to match our cleaned regex-ed strings.

Adam suggested using the front matter of the digitized text as they often contain a listing of judges for cases appearing in that volume. The problem is this data is redacted for post-1923 volumes:

Redacted front matter page (or is it a Kazimir Malevich)

So I can only work with pre-1923 front matter, which is fine because it is the older data that will not be recorded in sources like Courtlistener.

The next problem is identifying and extracting these names. There are less than 100 non-supreme pre-1923 volumes so I wrote a little manual tool to identify the correct page of the front mater the names appear and then can extract the names from the OCR XML data (Alto).

Identifying the judge name front matter pages

Using this data I should be able to get a decent handle on the appellate judge name problem and start attaching them to URIs.