Legal Citation Analysis with CourtListener and Cobaltmetrics

Legal Backlink Analytics is Ripe for Review

Luc Boruta
Jun 6 · 8 min read

Written by Casey Scott McKay and Luc Boruta.

Legal data is an untapped data science gold mine. Recent initiatives to open access to the law now make it possible to track and analyze legal data on a large scale. Cobaltmetrics partnered with CourtListener to explore the potential in tracking and analyzing citations to and from legal data.

Cobaltmetrics provides web scale citation tracking and backlink analytics. CourtListener provides free access to state and federal legal data. Combine the two and you get legal backlink analytics. Evaluating legal data gives insight into how resources are used, how resources influence other courts and other resources, and how different resources are connected across jurisdictions. Indexing CourtListener’s data into Cobaltmetrics for analysis shows the potential value available from further exploring legal citations.

More specifically, Cobaltmetrics crawls the web to index references and citations, primarily in the form of URLs, from a wide range of content including CourtListener. Cobaltmetrics’s citation index covers dozens of millions of documents, extracted from publicly available resources.

Powered by a knowledge base containing billions of URIs identifying all types of documents, Cobaltmetrics uses its transmutation API to extract all URLs from all federal and state court opinions in CourtListener. After indexing CourtListener’s data and processing all documents to extract URLs, the real fun begins — analyzing data!

Before discussing the problems at issue, it is important to note, URLs in CourtListener’s data may be separated into two general categories for the purposes of this article: primary URLs and secondary URLs. Primary URLs point to primary legal authority — opinions, statutes, rules, regulations, and legislation — and are legal citations made machine readable by CourtListener. Secondary URLs refer to all other websites, included but not limited to journal articles, legal guides, articles, or anything that is not primary legal authority. For purposes of this article, primary domains include courtlistener.com, villanova.edu, and the 420 domains listed for jurisdictions supported by CourtListener, except for wikipedia.org which is listed as the King’s Bench website in CourtListener metadata.

The most cited domains pointing to primary legal authority are courtlistener.com (2,838,181 citations, in this case links from court opinions to court opinions), villanova.edu (13,095), gaappeals.us(2,033), mspb.gov (1,898), and uscourts.gov (1,844). The most cited domains pointing to secondary legal authority are merriam-webster.com (827 citations), state.gov (510), nih.gov (440), usdoj.gov (435), and oed.com (414).

Current Legal Analytics and Citation Tracking Is Not Reaching its Full Potential

Current attempts at legal analytics fail to capture the full impact of the resources referenced in legal documents. They are not diverse enough, they are too slow, and they are hindered by technical problems in processing legal data.

Current Legal Analytics are Narrow

Before Cobaltmetrics, no one was tracking or analyzing legal backlinks on a large scale. Focusing on a specific court, jurisdiction, or a small sample of venues is not enough. To solve this lack of diversity, Cobaltmetrics analyzes all state and federal courts. Furthermore, current providers of legal analytics focus on analyzing the impact of primary legal resources. But the court uses many other secondary resources in interpreting the primary resources to determine complex decisions. And no one is analyzing this data on a large scale!

Along those same lines, no one is tracking or analyzing the court references to URLs on a large scale. Although the majority of court citations reference primary legal authority, the amount of court references to secondary authority is finally large enough to analyze:

  • Total number of URLs in CourtListener: 2,888,282
  • Number of URLs pointing to primary authority: 2,857,334
  • Number of URLs pointing to secondary authority: 30,948

Authors, publishers, legal professionals, and everyone in the communities surrounding the resources used by the courts are interested in what resources courts are using and how courts are using those resources. Limiting analysis of citations and references to a few jurisdictions or primary legal citations leaves valuable, untapped data in the wild.

Current Legal Analytics Are Slow

Law moves slow. Laws are written to remain robust and unchanged. Complex jurisprudence in important cases often takes years to resolve. And common law in the U.S. legal system is built on the concept of stare decisis — the idea that courts must rely on past decisions as primary authority to support opinions. Consequently, courts frequently reuse the same well established resources and new references are slow to enter legal documents. Because the same citations and references are often used until the law changes, there is less new data to analyze.

On the other hand, courts frequently reference scholarship from fields outside the law. Generally speaking, more progressive fields continually generate new scholarship that replaces older research and findings, which leads to a quicker citation turnover rate. Analyzing the courts’ use of URLs referencing secondary authority from fields other than law provides an additional, faster evolving timeline to evaluate.

Hoarded Data, Dirty Data, and Other Technical Challenges Along the Way

Tracking identifiers requires automated recognition of reference fields and metadata, and successfully linking resources depends on how well structured the data is. Legal data is often subject to many different standards across many jurisdictions. CourtListener includes documents for 419 U.S. state and federal courts. Many different courts with many different formatting standards makes natural language processing and full text indexing challenging.

What’s more, legal citations are complicated. Legal publications generally rely on full citations, short citations, and complicated BlueBook citation rules. Legal citations are also used in both text and footnotes, frequently containing explanations and comments, making processing the data even more challenging. Due to complex formatting standards, URLs are often sliced and separated by line breaks or page breaks, requiring sophisticated software to avoid collecting URLs that are invalid or incomplete.

The Supreme Court’s own publishing standards exemplify the challenges of analyzing legal citations automatically. Page 5 of the Court’s recent opinion in Utah v. Strieff demonstrates the difficulties machines face in trying to process the various formats of URLs used by the Court:

See, e.g., Brennan Center for Justice, Criminal Justice Debt 23 (2010), online at https://www.brennancenter.org/sites/default/ files/legacy/Fees%20and%20Fines%20FINAL.pdf.

The citation above shows a URL split by line breaks, meaning it is not machine readable as is. The new URL created by the first half of the original URL is syntactically valid, but does not resolve to any resource. In fact, 6 of the 8 URLs cited in that case are dead.

On that note, link rot — URLs referencing unavailable resources — is slowly tearing away at the precedential fabric our court system is founded upon. You cannot use legal authority to determine a decision if you cannot find the legal authority. A 2014 Harvard Law School study showed that over half of the links cited in Supreme Court opinions were dead:

[M]ore than 70% of the URLs within the Harvard Law Review and other journals, and 50% of the URLs within United States Supreme Court opinions, do not link to the originally cited information.

To solve link rot, Harvard Law School maintains perma.cc, a “forever business” that enables users to create permanent records of the web sources they cite. First appearing in court documents in 2015, perma.cc has grown to become one of the court’s most cited secondary domains.

Short links further aggravate the problem. The Supreme Court recently started using short links — e.g., a bit.ly or goo.gl short URL to replace the original long URL — to reference authority. The impermanent nature of shortened URLs raises several concerns. URLs only survive as long as the third party host keeps them alive. Placing the future of the court’s authority in the hands of a single third party, for-profit corporation is not an ideal way to preserve critical legal resources. Cobaltmetrics can reliably unroll URLs from these domains. As shown below, shortened URLs are starting to show up in court opinions more frequently. The most cited URL shorteners in the corpus are goo.gl (62 citations), tinyurl.com (25), andbit.ly (13).

The primary purpose of citations is to identify a source of information so future readers can verify the accuracy of the cited information. What happens when a future judge, lawyer, or learned layman wants to use a court’s argument as primary authority but cannot trace the resource because the URL does not work? How can those using the citation in the future be sure they are using the same resource the court viewed when it decided the case?

Malleable by design, the internet is continuously subject to change. While it enabled instant creation and exchange of knowledge, important institutions, such as the legal system, relying on references to authoritative resources that no longer exists prevents future readers from using the authority. This, in turn, prevents future readers from accessing authority used by the court thereby undermining the integrity of the court.

Analyzing Legal Data Provides Valuable Insight on the Court’s Use of Resources

Analyzing legal data sheds light on the the courts activity, what resources the court is using, and what resources are important to the court. This information can be used to assess the influence of the resources on judges, courts, and jurisdictions then compared across jurisdictions to establish links among resources.

Tracking and analyzing legal data allows examining the courts activity over time. The first URL was cited by a court in 1995 and the number of URLs cited in court opinions has steadily increased each year. The most cited domains in the first years of hypertext court opinions are:

  • 1995: courtlistener.com (53091 citations, i.e. citations between court opinions made machine readable as courtlistener.com URLs), villanova.edu (328), buffalo.edu(1), kun.nl (1)
  • 1996: courtlistener.com (58258), villanova.edu (255), bna.com (2), ed.gov (2), adult.com (1), ami-med.com (1), gw2k.com (1), heroes.org, nodak.edu (1), playmen.it (1), sf.ca.us (1), usatoday.com (1)

References to primary URLs and citations continuously account for the majority of the court’s references, with the court referencing primary URLs over 99% of the time and secondary URLs less than 1% of the time over the last decade. But the split is closing, growing from 2 URLs cited in 1995 to 21,540 in 2018. The most cited domains by jurisdiction are listed below:

  • Primary domains referenced most by federal court: courtlistener.com(1,115,429 citations), villanova.edu (13095), mspb.gov (1898), uscourts.gov(1815), va.gov (655)
  • Primary domains referenced most by state courts: courtlistener.com(1,722,742), gaappeals.us (2033), state.tx.us (1201), state.oh.us(612), ca.gov (447)
  • Secondary domains referenced most by federal court: state.gov (476), usdoj.gov (378), merriam-webster.com (319), nih.gov (290), oed.gov (266)
  • Secondary domains referenced most by state courts: merriam-webster.com(508), perma.cc (365), texasattorneygeneral.gov (300), nih.gov(150), cobar.org (133).

Evaluating the court’s behavior allows seeing what resources are important to the court. Moreover, what lawyer would not want to know what resources a court is using to make decisions to help lawyers accurately predicate a court’s behavior? Knowing the breadth of the court’s own library provides insight concerning the court’s potential legal analysis of future cases.

Legal Backlink Analytics is Ripe for Review

Analyzing legal data enables quantitative analysis of social phenomena associated with the courts. This, in turn, provides important information on court activity that serves as an important feedback tool to individuals, schools, lawyers, courts, and scholars. While much has been done by CourtListener to preserve and index court opinions, much remains to be done to better understand how legal documents interact with the web at large. Cobaltmetrics is dedicated to further tracking and analyzing links between documents, and especially between legal and scientific data.


Interested in learning more about Cobaltmetrics? Try it out, check out the public API, join our newsletter, and reach out at contact@thunken.com!

Thunken

Reflections on data, science, and data science.

Luc Boruta

Written by

Chief nerd at Thunken, natural language processor, PhD in computational linguistics. I eat language models for breakfast. 🐐

Thunken

Thunken

Reflections on data, science, and data science.