Harvard LIL Fellowship — Part 5

Matt Miller
5 min readNov 8, 2017

--

The final installment reviewing my work done as a Fellow at the Harvard Library Innovation Lab this summer. In the last part we looked at some approaches employed to build up metadata around a case law opinion. Using some algorithmic matching and a crowdsourcing tool to enhance each record with data that could be linked to external sources. The reason behind this preliminary work was to get records ready to exist in an linked data ecosystem. The overarching thesis supposed if you could weave this data into large knowledge systems like Wikipedia/Wikidata where the users are you can increase exposure to this new (freely) available and powerful dataset. We will also look at some potential discovery avenues linking this data opens up.

There already exists some structured case law data on these platforms. For example almost all U.S. Supreme Court opinons have an article on Wikipedia. At the state level there exists articles for the more important cases, for example notable cases from the Californian Supreme Court have articles. On the structured data side these articles made it into Wikidata at the federal level from in part from Wikisource a Wikimedia project that hosts the full text of resources. For example, picking a random case from Wikisource and then looking at it in Wikidata:

https://www.wikidata.org/wiki/Q19109323

A fairly spares record but it does have an instance of (basically rdf:type) “United States Supreme Court decision”. Compared to a more complete record, like Brown v. Board of Education who had a corresponding Wikipedia article:

https://www.wikidata.org/wiki/Q875738

If we look at one of the Californian Supreme Court decisions on Wikidata which, while there are some represented on Wikisource are largely not present on that platform, we see another sparse record:

https://www.wikidata.org/wiki/Q5397371

Since it did not come from Wikisource, and that data was not mapped from Wikipedia it has a higher class of “legal case.” Knowing this classification we can get a sense of how many of these resources are on Wikidata. Using the Wikidata Class Browser tool we can see how many entities have this class type. We will need to traverse the class tree a bit to get to our desired destination:

We can see the case law records that came in via the Wikisource ingest (only U.S. Supreme Court cases) has around 16K decisions while examples like our Californian Supreme Court decision falls into the more general “legal case” class which has 17K items. There maybe some overlap, but it is clear that these type of court decisions should ultimately be assigned their own class, “Californian Supreme Court decision” for example and be children of the “court decision” class. The takeaway however is that there is not a lot of case law decisions on the Wikidata platform.

So how can we get the output of the Caselaw Access Project into Wikidata? Here are a couple options, though of course these are not official plans, just brainstorming possibilities:

  1. Host the case law decisions full text on Wikisource. Create a Wikisource resource for each case law full text decision and basic metadata. Then work to have these ingested into Wikidata, as the U.S. Supreme Court decisions were created. Once that is complete you could enrich the metadata on Wikidata with Judge names, identifiers, etc.
  2. Work to create a Wikidata bot that generates these decision’s entities on Wikidata. With some remediation each decision could be populated with some data, for example:
{
instance type (P31) : Illinois Appellate Court decision (QxNEW),
applies to jurisdiction (P1001) : Illinois (Q1204),
point in time (P585) : 1905-03-23,
legal citation of this text (P1031) : 166 Ill. 2d 1,
label : "The people of the state of Illinois vs xxxxx",
docket number (PxNEW) : 74704,
author (P50) : Joseph Gary (Q6283370)
harvard caselaw id (PxNEW): 12345
}

There would be an opportunity to purpose some new Properties that better describe case law data, and generating a Harvard Caselaw Id would facilitate linking back to the full text of the opinions on a Harvard Library system.

Both approaches would get this valuable data into the Wikimedia linked data ecosystem.

Let’s pivot and look at a discovery use case this data would enable. For example by connecting the authoring judge of each opinion to to their Wikidata entity we have access to information about them not found in the document. Using these properties you can start to imagine building a discovery system that leverages a wide range of metadata to improve search.

I loaded about 40,000 Illinois Supreme Court opinions that underwent some of this rudimentary linking into a Blacklight instance to demonstrate this concept.

For the next month or two the server should be live, after that please refer to the videos below for examples.

Finding Illinois Supreme Court opinions written by women who were also politicians.
Finding opinions written by judges who were also authors and fought in the American Civil War.

While this initial demo may not seem impressive the important part is that the case law decisions are now connected to the linked data cloud. When more information is added, to a judge for example, that data becomes available to this system. It becomes a powerful feedback loop that can over time enhance metadata and improve collective discovery.

These examples are using enriched data for the opinion authors, but this could be extended to entities found in the opinion text. I ran the 40K full text opinions through Stanford’s Named Entity Recognition tool to pull out possible entities.

Looking at other possible entities extracted and linked in the Illinois Supreme Court opinions

Once these entities were connected to their Wikidata entities you could then ingest any metadata about them and also use it to facet on in the discovery interface. Connecting these data points together and leveraging them really opens up some compelling opportunities.

This wraps up my documentation on the fellowship this summer! All the fellows and interns presented their work in a culminating event in early August, here are my slides from that presentation:

https://docs.google.com/presentation/d/1X2BCKE44xcc6DytctXcHzibND5hqRFyN5UZ9ggj7WU0/edit?usp=sharing

Also take a look of the write up of this event (part 1 and part 2) to see what my incredibly bright colleagues worked on this summer. I am very grateful to have had the opportunity to work with the amazing folks at LIL and the other fellows and interns this summer. The Caselaw Access project is a tremendously important endeavor and I’m glad I could participate in some small way.

--

--