A Data Commons for Law

By Margaret Hagan, Jameson Dempsey & Jorge Gabriel Jiménez

[This is Part 1 of 3. See Part 2 on Models for a Legal Data Commons. And part 3 is coming later this week]

How might we harness available data from legal aid organizations, courts, legal technology companies, and others to enable research and development that promotes access to justice?

What if our community of access to justice researchers, innovators, and policy-makers had a better structure to share data?

We propose to build a “Legal Data Commons” to achieve that goal. In a second part, we will summarize the different models to share legal data, and in a third (and last) part we will propose what is needed to build a legal data commons and how can people can be involved.

I. Background: Why a Legal Data Commons?

The rise of cloud computing and advanced analytics has had a dramatic impact on the ability of researchers and policymakers to craft and evaluate policies based on real evidence.¹ Further, access to large and multifarious datasets can improve researchers’ abilities to reveal valuable insights that can drive policy outcomes. Unfortunately, while many policymakers are showing a greater appetite for evidence-based or data-driven policy, attempts to evaluate policies face the barrier of balkanized or inaccessible data.²

In response to these challenges, coalitions of stakeholders have come together to develop “data commons” that share data publicly or semi-publicly for research and other purposes. For example, in 2009, the Open Commons Consortium launched the Open Science Data Cloud as a science cloud with multidisciplinary scientific data for researchers.³ But data commons are not limited to scientific data. In fact, today, there are more than twenty-five controlled data repositories of knowledge commons for the humanities, social sciences, health, and earth/space fields.⁴ Moreover, while the U.S. government has long provided data privately for research purposes, the last decade has seen tremendous growth in government “open data,” first as a matter of federal policy⁵ and today as a matter of law through the OPEN Government Data Act⁶. A similar effort is playing out in the local context.⁷

Despite the tremendous technological advancement in other sectors, the legal world lags behind. This lag is attributable in part to legal culture and professional regulations, but also because of the lack of open access to data. Specifically, while there are several publicly available legal databases,⁸ most of the data from courts, legal aid organizations, and legal tech companies are not readily accessible for reasons including cost, use restrictions, or concerns about confidentiality and consumer protection.

Consequently, open legal datasets are scarce for research and development, including for the development of artificial intelligence (AI)-based legal technology that could advance access to justice. Some examples of data-driven projects to advance access to justice include:

  • Stanford’s Open Policing Project,⁹ that aims to help researchers, journalists, and policymakers investigate and improve interactions between police and the public,
  • Princeton’s Eviction Lab, that gathers data on housing court and eviction to give more insight into the system,
  • Code for America’s Clear My Record project that both helps people to clear their criminal record, and that works with governments to automate this process
  • MD Expungement,¹⁰ a web application that lets a person or advocate identify if someone qualifies for Expungement of their criminal record, and
  • Learned Hands,¹¹ a project to train machine learning models to identify people’s legal issues from text or transcripts of their stories about problems.

We believe that a legal data commons — built with privacy and accountability “by design” — could solve the data issue and advance research and innovation objectives while addressing legitimate confidentiality concerns.

II. What is a “data commons”?

A ‘data commons’ refers to knowledge that is collectively owned and managed by a community of users, usually over the Internet. In a data commons, users co-locate data on a cloud computing infrastructure that commonly uses software, tools & apps for managing, harmonizing and analyzing data for the final user.¹²

A data commons represents an evolution from earlier data structures, such as databases and data clouds, and provides unique advantages for researchers, as described in this chart:

Credit: Robert Grossman¹³

In order to succeed, a data commons must have a few important characteristics:

  1. A data commons must engage key stakeholders: (1) a service provider, which operates the data commons; (2) a data contributor, which provides the data for the commons; and (3) data users (e.g., researchers, policymakers, application developers, and others), which access the data commons to advance their work. A single organization may represent one, two, or three of these roles at the same time, and all are required for the data commons to function.
  2. A data commons must be secure to preserve permissions, protect data subjects, and prevent corruption of data. This is particularly important where the data involves sensitive personal information, such as health records, children’s information, financial/credit data, or legal information.
  3. A data commons must be usable, permitting many actors to contribute and use the system. A usable system requires significant investment in back-end and front-end systems, storage space, and clear policies and procedures related to data access and usage.
  4. A data commons must be scalable and extensible to permit new features and datasets over time (including updated data in existing data sets).
  5. A data commons must be interoperable to permit analysis across datasets and through a variety of tools.

III. Why do we need a legal data commons?

Today, the power of machine learning to drive research and policymaking depends largely on availability and adequacy of data.

The legal ecosystem — including courts, law firms, legal tech companies, legal aid organizations, and universities — produce a vast amount of data that could be useful for modeling and building intelligent systems, and most especially in terms of addressing access to justice problems.

Unfortunately, too much of this legal data is unstructured, balkanized, or otherwise unavailable for research and development purposes. As a result, the primary beneficiaries of machine learning in law are those private parties who have access to large databases and can afford to conduct in-house analysis, but rarely share the data for research and development purposes.

Available data sources fall into several general categories, each with its own challenges:

Even if this legal data were accessible, however, it is isolated, making it more difficult for discovery and analysis. A legal data commons could resolve this problem and allow researchers to process link data and apply modern analytical approaches to cross-legal-organization problems. A legal data commons could achieve several critical goals of the access to justice community:

  1. Research that could more quickly lead to access to justice interventions
  2. Comparison across different legal datasets (and attendant network effects)
  3. Access to public records from courts and other government stakeholders
  4. Better machine learning technology to close the justice gap.

In the coming week, we will follow up with more details on how such a Data Commons for Law might come into being, and how it might operate. See Part 2 on Models for Data Commons for Law.

  1. Daniel Esty & Reece Rushing, The Promise of Data-Driven Policymaking Issues in Science and Technology(2015), https://issues.org/esty-2/ (last visited Feb 12, 2019).
  2. Lauren Greenawalt, What Will It Take to Achieve Truly Data-Driven Policy?New America(2018), https://www.newamerica.org/weekly/edition-214/what-will-it-take-achieve-truly-data-driven-policy/ (last visited Feb 14, 2019); Jane Yakowitz, Tragedy of the Data Commons, 25 Harvard Journal of Law & Technology 1–67 (2011).
  3. Robert L. Grossman et al., A Case for Data Commons: Toward Data Science as a Service, 18 Computing in Science & Engineering 10–20 (2016).
  4. Kristin R. Eschenfelder & Andrew Johnson, Managing the data commons: Controlled sharing of scholarly data, 65 Journal of the Association for Information Science and Technology 1757–1774 (2014).
  5. Transparency and Open Government, National Archives and Records Administration, https://obamawhitehouse.archives.gov/the-press-office/transparency-and-open-government (last visited Feb 12, 2019).
  6. OPEN Government Data Act, https://www.congress.gov/bill/115th-congress/house-bill/1770
  7. For example, New York City’s Criminal Justice Reform Act (CJRA) allows low-level misdemeanors (e.g., parking violations) to be diverted from the criminal justice system to a civil court, with the future evaluation in mind. The law could potentially reduce racial and geographic disparities among low-level offenses. To measure this, the bill requires the city’s police department to publicly report, each quarter, counts of criminal and civil summonses issued by offense, race, and geography, among other factors. The quarterly reports, with additional policy evaluations, are helping the council and the public to see if the CJRA is meeting its goals, and informing future policy discussions on how to improve or expand the policy. Impact of NYC’s Criminal Justice Reform Act, John Jay College of Criminal Justice(2018), https://www.jjay.cuny.edu/news/impact-nycs-criminal-justice-reform-act (last visited Feb 14, 2019).
  8. 10 Best Legal Datasets for Machine Learning, Gengo AI (2018), https://gengo.ai/datasets/10-best-legal-datasets-for-machine-learning/ (last visited Feb 12, 2019).
  9. The Stanford Open Policing Project, openpolicing.stanford.edu, https://openpolicing.stanford.edu/ (last visited Mar 20, 2019).
  10. Matthew Stubenberg, Maryland Criminal Record ExpungementMaryland Expungement, https://www.mdexpungement.com/ (last visited Mar 20, 2019).
  11. Learned Hands, Learned Hands, https://learnedhands.law.stanford.edu/ (last visited Mar 20, 2019).
  12. Robert L. Grossman et al., A Case for Data Commons: Toward Data Science as a Service, 18 Computing in Science & Engineering 10–20 (2016). The ‘commons’ concept derives from decades of economic, social, and legal research about shared resources, broadly defined to include everything from natural resources (e.g., water, arable land) to knowledge. https://en.wikipedia.org/wiki/Commons. A ‘knowledge commons’ — as Elinor Ostrom conceptualized — includes all intelligible ideas, information, and data in whatever form in which it is expressed or obtained. Charlotte Hess & Elinor Ostrom, Understanding knowledge as a commons: from theory to practice(2007). A data commons is a type of knowledge commons.
  13. Robert Grossman, Crossing the Analytics Chasm and Getting the Models You Developed Dep…LinkedIn SlideShare(2018), https://www.slideshare.net/rgrossman/crossing-the-analytics-chasm-and-getting-the-models-you-developed-deployed?next_slideshow=1 (last visited Feb 19, 2019).