The Key Models for a Legal Data Commons

Jorge Gabriel Jiménez
Legal Design and Innovation
12 min readApr 2, 2019

By Margaret Hagan, Jameson Dempsey & Jorge Gabriel Jiménez

[This is Part 2 of 3. Stay tuned for part 3 coming later this week. You can read part one on why we need a Legal Data Commons here.]

How might we harness available data from legal aid organizations, courts, legal technology companies, and others to enable research and development that promotes access to justice?

In our previous post, we proposed to build a “Legal Data Commons” to achieve that goal. In this article, we will summarize the different models to share legal data, and on a third (and last) post we will mention what is needed to build a legal data commons and how can people involve.

What models could we adopt for legal data sharing?

The idea of building a data commons or sharing data for research is not new. There are data commons for social science, humanities, human health, ecology, chemistry (or molecular) data, and earth and space science.¹ In this section, we highlight a few potential models and lessons learned that could be valuable in developing a Legal Data Commons. Specifically, we describe the following three models: (1) an informal sharing setup, (2) a data cloud, and (3) a data commons.

As we see more specific data-gathering and -sharing initiatives come together in law — including these:

We see a potential in finding a more intentional model to network different initiatives to gather data, share it, and build research and innovation ecosystems around it. Here we present some models that can be options for us in the legal community, to build a more intentional data ecosystem.

Model 1: Informal Data Sharing

In this model, stakeholders — e.g., like-minded universities, legal aid organizations, and LSC groups — would informally share the public datasets they have gathered together and encourage others to join the effort. This model would require the least time to launch, but is not a permanent solution and includes some scaling challenges. For example, some organizations might not feel comfortable hosting or sharing sensitive information in this setup.

To mitigate risk in an informal sharing arrangement, we would need to have a legal infrastructure in place to address governance, basic sharing principles, and intellectual property. This could be done in two ways (separately or blended) — through a consortium with no central entity, or a lightweight nonprofit with a central entity.

Option A for Informal Data-Sharing: Consortium (No Central Entity)

In this option, data-holders would set up a new third-party, ad hoc network with a shared set of governance principles enshrined in interlocking agreements between all entities. If an individual or organization wants to share data with ‘the consortium’ they could sign a memorandum of understanding (MOU) with the consortium. If an individual or organization wants access to the data that the consortium holds, they could sign a standardized agreement to use the data. The consortium model is a stop-gap to a more permanent solution, and as such the owner of the data would have the right to decide if he moves the data to another entity at any time. While this proposal arguably involves the least set-up time, it also includes several unknowns. For example, it is not clear where the data would be stored or how access and permissions would be implemented.

Option B for Informal Data Sharing: Lightweight Nonprofit (Central Entity for IP/Governance)

In this option, rather than establish an ad-hoc, contract-based consortium, data-holders would establish a nonprofit to handle decisions about governance, management, storage, and distribution and data, as well as intellectual property. The nonprofit board would be comprised of different stakeholders (non-profit, universities, among others) that would manage the nonprofit. However, the data sharing arrangement would remain primarily contract-based and the capabilities of the system would be more limited than a full-scale data cloud or data commons.

While the informal sharing setup is not a permanent solution, it could be the first step toward a data commons, enabling stakeholders to share public legal datasets, curate resources, and facilitate collaboration.

Model 2: Data Cloud

Another potential model for legal data sharing is a data cloud. The primary difference between informal sharing and a data cloud is that the data cloud would be centralized and formalized.
Data clouds support big data and intensive computing for analyzing information with collaborative tools inside the cloud. Below we share two existing models for data clouds that could be helpful for a legal data commons: ICPSR at the University of Michigan and the Dataverse Project based out of Harvard University.

Example A of a Data Cloud: ICPSR

The Inter-university Consortium for Political and Social Research (ICPSR) “advances and expands social and behavioral research, acting as a global leader in data stewardship and providing rich data resources and responsive educational opportunities for present and future generations.”² Specifically, ICPSR provides “data access, curation, and methods of analysis” as a service to researchers.³

Overview: ICPSR was established in 1962 at the University of Michigan as a means of sharing political science data; specifically, survey data from the American National Election Studies. At its founding, data sharing in a consortial framework was viewed as a radical act — a sharp departure from past practice where research data was jealously guarded. Since then, the ICPSR has grown into one of the largest data consortia in the world, with over 780 member institutions and support from several large government agencies and foundations. The ICPSR is funded by membership fees, grants, and contracts, with annual revenues of $18.8 million in 2017 (with $17.7 million in annual expenses).⁴ ICPSR is governed by a 12-person council and its work is largely conducted by five standing committees and specialist advisory committees.

Contributing data: Anyone with relevant data can deposit it into ICPSR, and ICPSR provides two ways to deposit data, both of which are free to upload and come with user support, training, and use of the ICPSR Bibliography. The first option is self-publication (via openICPSR)⁵, which provides tools for data contributors to clean their own data for submission. openICPSR is free up to a 2GB contribution limit, after which it requires a subscription. The second option is curated data, in which the ICPSR team will help curate, clean, and present an archive of open access legal datasets, which makes data more accessible and usable.⁶ There are three choices for curated data deposits as shown in the following chart:

Using data: ICPSR provides researchers with means of discovering data on the website, including searching by studies, variables, and publications. The consortium also provides student researchers with helpful tools and educational materials to conduct research.

What can we learn from this project?

  • Managing large repositories of data is time and cost intensive, requiring significant resources (which may be obtained through multiple revenue streams)
  • To make a large-scale data cloud useful, it is important to provide robust discovery capabilities

Example B of a Data Cloud: The Dataverse Project

The Dataverse Project (Dataverse) is a Harvard-based, open source web project that uses containers to share, preserve, cite, explore, and analyze research datasets.⁷

Overview: Dataverse was developed by the Institute for Quantitative Social Science (IQSS) at Harvard University in 2006, and mainly hosts data created or recollected by researchers. Dataverse is open to all researchers worldwide in all disciplines, and as a consequence has been very successful. It hosts more than 70,000 datasets, and it has around 60,000 downloads per month. In addition, because the software is open source, enabling anyone to set up their own dataverse, 26 institutions around the world also currently use Dataverse. A Global Dataverse Community Consortium (GDCC) was formed to provide formalize collaboration among these dataverses and coordinate efforts in a community-oriented fashion.

Contributing data: Dataverse works by enabling researchers and other data contributors to place datasets into “dataverses,” which are containers that store datasets, documentation, code, and metadata. Harvard offers its own dataverse — the Harvard Dataverse — and permits any researcher to upload datasets in any format and receive an academic citation for their research. In addition, any institution or individual researcher can download open source Dataverse software to create their own customized dataverse. The following graphic shows how dataverses work:

Source: Merce Sosas, The Dataverse project, 2017.

Importantly, Dataverse solves a problem that has long dissuaded researchers from contributing data to shared repositories: choosing between (1) receiving credit for their data by controlling distribution themselves without long-term preservation guarantees, and (2) having long-term preservation guarantees by sending the data to a professional archive but without receiving credit for their data. Dataverse removes the need to choose, serving as a virtual archive while giving the researcher full credit, web visibility, and long-term preservation guarantees.⁸

Moreover, to promote privacy, Dataverse uses DataTags, a suite of tools to help researchers share and use sensitive data in a standardized and responsible way.⁹ The DataTags system asks a data provider a series of questions to identify the laws, contracts, and best practices that should be applied to a given dataset. After conducting the analysis, the DataTags system assigns iconic labels that represent a human-readable and machine-actionable data policy, and issues a license agreement that is tailored to the individual dataset. This graphic explains how DataTags works.¹⁰

Source: Merce Sosas, The Dataverse project, 2017.

Finding and using data: Dataverse’s open source software provides easy tools and powerful search functionality for researchers to find data they need. Users may browse, search, and perform analysis on dataverses and published datasets without logging into the system. The Dataverse Project also provides a list of known dataverses around the world. Once a user has found the data they seek, they may view, cite, or download data. Users may also explore data using the TwoRavens interface or visualize data using WorldMap, both of which are open source.

What can we learn from this project?

  • It is possible to provide long-term preservation guarantees and individual credit to researchers who contribute data.
  • DataTags can help researchers share and use sensitive data responsibly.
  • Building and supporting a strong community is essential to a successful project.

Model 3: Data Commons

Example A of a Data Commons: Open Commons Consortium

The Open Commons Consortium (OCC) has created an ecosystem of data commons, managing or working together with other data commons of scientific, medical, health and environmental research information.¹¹

Overview: This ecosystem provides a long term home for scientific data and specializes in supporting data-intensive research projects for researchers who need large-scale resources. Other data commons managed by OCC include the Open Science Data Cloud¹², Matsu Data Commons¹³, National Oceanic and Atmospheric Administration¹⁴, and the National Cancer Institute’s Genomic Data Commons¹⁵.

Contributing data: Data clouds within the OCC have application processes in place to contribute data. For example, the Open Science Data Cloud provides a public application form that asks for the contributors’ name, information about the data set, permissions, acknowledgements, and publications.

Finding and using data: Data on the OCC is accessible through an application programming interface (API). This application is a difference from the Coleridge initiative that only permits browsing the data through a portal. Users may either download data on the data commons or apply for compute power to conduct research within the commons itself. Once granted access a grantee is expected to abide by community guidelines and security best practices.

What can we learn from this project?

  1. Datasets in the commons must have stable, persistent digital IDs that include access control for the data and metadata and can return the metadata.
  2. An API can make it easier for individuals to find and use data, and to authenticate users who are accessing data.
  3. Portability is important to permit sharing and transporting of data across data commons.
  4. Data peering agreements between two data commons service providers are an effective way to transfer data at no cost so that a researcher can access data stored at another data commons.
  5. Allocating, time-restricting, or charging for access to compute resources is a reasonable way of mitigating/offloading the costs of managing a large dataset.¹⁶

Example B of a Data Commons: Coleridge Initiative/ADRF

The Coleridge Initiative is another successful data commons that its advantage is the ability to link public and private datasets for research or innovation projects.¹⁷

Overview: The Coleridge Initiative manages the NYU Administrative Data Research Facility (ADRF). The ADRF provides (1) a secure environment for data storage from government agencies; (2) administrative tools that allow data to be discovered, documented and used; and (3) training to ensure that the data repository is useful and usable by researchers, government staff, and agency decisionmakers. According to Coleridge, the ADRF has been used by 200 government agency staff and researchers, and has hosted almost 50 confidential government datasets from 12 different agencies at all levels of government. The Coleridge Initiative is supported by donors including the Laura and John Arnold Foundation, the Ewing Marion Kauffman Foundation, the Alfred P. Sloan Foundation, the Overdeck Family Foundation and the Bill and Melinda Gates Foundation.

Contributing data: The primary data contributors to the ADRF are government agencies. To combat strong disincentives that U.S. federal government agencies face with respect to data sharing across programs or with external parties, the Coleridge Initiative provides a secure platform for controlled and protective sharing, certified by FedRAMP. FedRAMP is the government’s process for assessing and authorizing cloud service providers. ADRF provides government agencies with greater control over and visibility into how their data is being used.

Finding and using data: Coleridge understands that getting data from different agencies is not enough. The system enables users to contribute content about the data itself (limitations, variable definitions), preprocessing (code and lookup tables), analysis (code snippets), how the data has been used (research topics), who has used the data, and code that has already been used to process and link the data.

What can we learn from this project? The crucial aspects that have made the initiative successful are these:

  1. Secure environment. Provide a safe environment within different government agencies (data providers) can place and share their data.
  2. Operational system. Create an efficient and secure system for the people, data, and projects that interact in the data commons.
  3. Training to use the data commons. The training program for the employees of the government agencies (data contributors) develops the skills and knowledge necessary to take the best advantage of the wealth of data that is available.
  4. De-identification process. All the micro-data is added to the data commons have certain elements (e.g., names, Social Security Number, business identifiers) de-identified before being made available.

A community of users. The data commons service provider has created different tools that facilitate effective communication between teams or users of the system.

~~~~~~

Have you seen any other models, forms of governance, or structures that we should add to this list? Please comment below to let us know!

Also, stay tuned for the last article on what is needed to build a Legal Data Commons. (And read our first piece on why we need a Legal Data Commons).

  1. Kristin R. Eschenfelder & Andrew Johnson, Managing the data commons: Controlled sharing of scholarly data, 65 Journal of the Association for Information Science and Technology 1757–1774 (2014).
  2. About ICPSR, National Incident-Based Reporting System Resource Guide, https://www.icpsr.umich.edu/icpsrweb/content/about/ (last visited Feb 22, 2019).
  3. Id.
  4. See 2016–2017 ICPSR Annual Report, https://www.icpsr.umich.edu/files/ICPSR/about/annualreport/2016-2017.pdf.
  5. openICPSR: Share your behavioral health and social science research data, https://www.openicpsr.org/ (last visited Feb 25, 2019).
  6. Start Deposit, National Incident-Based Reporting System Resource Guide, https://www.icpsr.umich.edu/icpsrweb/deposit/ (last visited Feb 26, 2019).
  7. About, The Dataverse Project — Dataverse.org, https://dataverse.org/about (last visited Feb 26, 2019).
  8. Idem.
  9. Harvard University Privacy Tools Project, Differential Privacy, https://privacytools.seas.harvard.edu/datatags (last visited Feb 26, 2019).
  10. Merce Crosas, The Dataverse project, 2017.
  11. Robert L. Grossman et al., A Case for Data Commons: Toward Data Science as a Service, 18 Computing in Science & Engineering 10–20 (2016).
  12. A Petabyte-scale Scientific Community Cloud, — Open Science Data Cloud, https://www.opensciencedatacloud.org/ (last visited Feb 19, 2019).
  13. Maria T. Patterson et al., The Matsu Wheel: a reanalysis framework for Earth satellite imagery in data commons, 4 International Journal of Data Science and Analytics 251–264 (2017).
  14. National Oceanic and Atmospheric Administration, https://www.noaa.gov/big-data-project (last visited Feb 19, 2019).
  15. Home, Home | NCI Genomic Data Commons, https://gdc.cancer.gov/ (last visited Feb 19, 2019).
  16. Robert L. Grossman et al., Supra., pp. 5.
  17. The Coleridge Initiative, https://coleridgeinitiative.org/ (last visited Feb 19, 2019).

--

--