Last week I outlined the goal of my fellowship at Harvard’s Library Innovation Labs. tl;dr — Try to connect case law data to the wider web. This week will look at some first steps using the entire Illinois case law corpus as a test bed.
The case law data is stored in a METS format it includes a lot of technical metadata documenting the digital assets and includes PREMIS events. Part of the METS document contains an embedded XML files with the contents of the case law metadata.
This is pretty interesting approach, I don’t know how common it is but it allows the technical and content metadata to exist in the same file. For my purposes I only care right now about the content of the case law, not the technical metadata. So I wrote a parser that extracts all the data I’m interested in and complies it into a JSON new line delimited file. This allows me to do some rearranging of the elements make things a little more accessible, instead of opening hundreds of thousands of xml files I can stream one JSON file. One place that XML does outshine JSON is in nested markup elements. In this case footnotes, for example:
<p pgmap="329" id="1-19">Common carriers — their liability to transport goods beyond their own lines. When a carrier receives goods to carry, marked for a particular place, he is bound, under an implied agreement from the mark or direction, to carry to and deliver at that place, although it be a place beyond his own line of carriage.<footnotemark>*</footnotemark></p>
The <footnotemark> denotes the symbol used to connect it to the footnote in another element:
<p pgmap="389" id="b389-19">* See Ill. Cen. R. R. Co. v. Cowles, 32 Ill. 120.</p>
This is actually not that easy to represent in JSON, and tools that convert XML to JSON structure will just flatten embedded tags like <footnotemark>. To get around this I do some preprocessing of the footnotes and generate a hybrid element out of them, one that includes the footnote and the context it was referenced in:
value: '* See Ill. Cen. R. R. Co. v. Cowles, 32 Ill. 120.'
context: '5. Common carriers — their liability to transport goods beyond their own lines. When a carrier receives goods to carry, marked for a particular place, he is bound, under an implied agreement from the mark or direction, to carry to and deliver at that placo, although it be a place beyond his own line of carriage.*',
It is an interesting use case when comparing XML and JSON.
This tool then outputs the data I want into a format I can quickly work with, it took about 20 minutes to convert the 180,000 Illinois xml files into a 3GB JSON file.
The next step is parsing the data, in this format it only takes a simple script to access all of the records. While a database would be much faster, I can iterate over the +180K case data in less than a minute. For example this script which tallies the types of cases takes 45 seconds to run:
'Illinois Supreme Court': 49694,
'Illinois Appellate Court': 122678,
'Illinois Courtof Claims': 10195,
'Illinois Circuit Court': 199
Now we are getting to interesting data! The majority of the cases are from the Appellate court followed by Supreme. If we want to start connecting these case opinions to external data we need something to connect. I decided to start with judges as there is some data out there already and it is kind of the lowest hanging fruit to begin with.
We need authorities that have Judge names to begin saying this opinion was issued by X judge. Often in the opinion text it will just have a last name of the judge, so we need an index to be able to say this last name “Koerner” is actually this specific judge that was presiding in Illinois during 1885.
We need to build these indexes from multiple sources and then try to connect them together. Trying to get a list of historical (from the early 1800s to present) state supreme court judges is not easy but it is much more simple than trying to find historical appellate judges. I’m using three sources, the Illinois government website, wikipedia/wikidata lists, Courtlistener and eventually JFC(for circuit court data). All sources have their strengths and weakness, for example Courtlistener has very good structured data but does not have super old historical judges from the 19th century. The GOV website is very complete but is HTML alphabet soup.
This endeavor lives in its own repository, so far I only have Illinois Supreme Court linked between Gov/Wikipedia/Courtlistener (with images!)but it is enough to start working with the Illinois Supreme Court opinions next week. Even this simple linking could open up some interesting discovery avenues, for example you could take the data from wikidata (via wikipedia ID) or Courtlistener and find out the political affiliation of each judge to view opinions by political party by decade.
Next week I will begin reconciling judge names found within the opinions to these indexes, a classic linked data problem, strings to things.