Sitemap

Some Proposed Principles for Interoperating Data Commons

6 min readOct 1, 2019

The NCI Genomic Data Commons (GDC) is used by over 100,000 researchers each year to explore interactively over 2.5 PB of cancer genomics and associated clinical, imaging and other data. This approach to accelerating research and discovery by sharing data using cloud-based platforms is having a clear impact, and other projects and organizations are developing data commons and other platforms for sharing biomedical data to accelerate research and improve clinical outcomes.

Other systems and platforms for analyzing, exploring and sharing biomedical research data include: the Kids First Data Resource for pediatric cancer and birth defects data; the DataSTAGE platform being being built for the TOPMed and other data by NHLBI; and the AnVIL platform being built for genomics data by NHGRI.

About two years ago, several of us wrote a Medium post about four governing principles for building data commons. We proposed that a data commons be:

(1) modular, composed of functional components with well-specified interfaces; (2) community-driven, created by many groups to foster a diversity of ideas; (3) open, developed under open-source licenses that enable extensibility and reuse, with users able to add custom, proprietary modules as needed; and (4) standards-based, consistent with standards developed by coalitions such as the Global Alliance for Genomics and Health (GA4GH). [This paragraph is a direct quote from the post: A Data Biosphere for Biomedical Research.]

For simplicity, we will use the terms data commons, (data) resource, or (data) platform interchangeably here to refer to these systems and to similar systems.

The good news is that, by and large, the data resources mentioned above are all working to follow these principles.

As the number of data commons begins to grow, it is becoming critical to establish some principles so that data commons can interoperate, allowing researchers to access, explore and integrate data from multiple data commons. Interoperating data commons in this way is a critical step towards creating a data ecosystem. In this note, I propose some principles for this purpose.

It’s helpful to distinguish between technical guidelines and operating principles for data resources. Technical guidelines can follow standards, such as those being developed by the Global Alliance for Genomics and Health (GA4GH), or can follow technical best practices, such as those being developed by the NCI Cancer Research Data Commons (CRDC) and other projects that are developing data ecosystems. Of course, over time the hope is that these converge.

Operating principles include questions about which platforms can interoperate, whether a platform will expose an API, whether a platform will be open and support different applications or will be closed and only support a single application, etc.

Today, there is general agreement on the former, but a lot of continuing discussion about the latter.

Before we discuss the operating principles, we need one definition. Let’s define a trusted platform as a data common, or other data analysis or data sharing platform, that i) operates with a common set of policies, procedures and controls that are agreed to; and, ii) is operated by known or trusted organization. As an example, two data commons that both operate with FISMA Moderate security and compliance and are operated by two different NIH Institutes or Centers would, in general, each treat each other as trusted platforms. With this definition, two platforms agree directly to trust each other.

Here are some proposed technical guidelines:

Technical guidelines:

1. Identify data in your resource using persistent Digital IDs, not by their physical location or their URL, which may change over time. For example, both DOIs and the Data Commons Framework Services uses a prefix-suffix, format where the organization associated with the prefix assigns the suffix and there is a delegation to the organization associated with the prefix to a service for resolving the suffix.

2. Expose your data through an API.

3. Expose your data model through an API.

4. Interoperate with third party authentication and authorization services from other trusted platforms.

5. Interoperate with other trusted resources with similar security & compliance.

6. Process authorized queries, and, more generally, authorized computations, from other systems and return the results.

What this means is that two resources that trust each other as data platforms can interoperate by opening their data via an API, using a common set of authentication and authorization services, identifying data by persistent digital IDs (vs the data’s physical location within a certain cloud), and allowing researchers to analyze data using applications of their choice as long as the applications are part of the other platform’s security and compliance boundary.

In practice, this behavior is not always followed by data resources. Here are some worrying tendencies that should be resisted if you operate a data resource:

Please don’t:

bring data from other resources and platforms into your platform, but don’t let your data out;

refuse to expose any API and instead require all users to use your platform or a particular application in your platform;

refuse to interoperate with other systems with the same or greater security and compliance.

Here are five proposed principles for creating a data ecosystem that may help you navigate some of the issues that arise:

Operating Principles for Resources in an Ecosystem (still a draft):

1. Interoperate with other trusted platforms: if another trusted platform is part of your data ecosystem or wants to create an ecosystem with you, then interoperate with it.

2. Follow the golden rule of data resources: if you take someone else’s data, let them have access to your data (assuming you are operating at the same level of security and compliance).

3. Support the principle of least restrictive access: Provide another trusted platform access to your data in the least restrictive manner possible. With rare exceptions, a data resource should provide an API so that application in other trusted platforms can access data directly. If this is not possible due to the size or sensitivity of your data, then support the ability for approved queries or analyses to be run over your data and the results returned. Sometimes this is called an analysis or query gateway.

4. Agree on standards, compete on implementations: It is important to open up your ecosystem to competition, less it stagnates. What this principle means is that a platform should expose its data and resources via APIs so that other applications can be part of your ecosystem. It is not necessary that the sponsor of the data resource fund other systems or applications, but you do not want to implicitly create a monopoly by requiring all users of your data use a particular application or platform. If you have limited funds, you may not always be able to fund multiple competing applications or platforms, but you should at least keep your system open so applications can compete to provide the best experience for researchers. Remember that not all researchers have the same requirements, or the same preferences, and in general a mix of applications, systems and platforms is better than a single one.

5. Plan to support patient partnered research: Plan for a future in which individuals can provide their data and have control over it within your system. Even better, provide such a capability today within your platform.

Trusted Platforms

A trust relationship between two resources in a data ecosystem requires agreements between two organizations about a number of matters, including: security; compliance; liability; data egress charges; and infrastructure costs.

For this reason, a formal agreement between two different organizations or a memo between two different units within an organization or agency is usually required. As an example, an Interconnection Security Agreement (ISA) between two platforms would serve this purpose.

For simplicity in the discussion above, we assumed that two platforms must explicitly agree to trust each other as trusted platforms and that if Platform A trusts Platform B, and Platform B trusts Platform C, then a trust relationship Platform A and Platform C is not automatic unless Platform A and Platform C establish it.

Appendix: Some definitions

By a data commons we mean a software platform that co-locates: 1) data, 2) cloud-based computing infrastructure, and 3) commonly used software applications, tools and services to create a resource for managing, harmonizing, analyzing and sharing data with a scientific community. You can find more details here.

A simple data ecosystem can be built when a data commons exposes an API that can support a collection of third party applications that can access data from the commons.

More complex data ecosystems arise when multiple data commons and data clouds can interoperate and support a collection of third party applications by using a common set of core services (called framework services) that provide support for authentication, authorization, digital IDs, metadata, importing, exporting and harmonization of phenotype data, etc.

--

--

Robert Grossman
Robert Grossman

Written by Robert Grossman

I'm a data scientist at the University of Chicago and a Director of the Open Commons Consortium.

No responses yet