Advancing a National Cancer Knowledge System

By Tony Kerlavage, Ph.D., Chief of the CBIIT Cancer Informatics Branch and lead for the CGC Pilots

We’re certainly in the midst of the era of big data, and it’s not a stretch to say that big data is changing many aspects of our lives: how we work, how we communicate, and how we learn.

Given the immense complexity of cancer — from the molecular circuitry of a cancer cell to the systems of care needed to meet the needs of those diagnosed with cancer — big data also has the potential to transform what we know about, and how we approach cancer.

But the power of big data can extend only as far as our ability to ensure access to those who can maximize its potential.

That concept, of expanding cancer data access and sharing, is a foundational tenet of the Cancer Moonshot being spearheaded by Vice President Biden. And earlier today, when the Vice President presented the report of the Cancer Moonshot Task Force, he also announced several new efforts that will advance the Cancer Moonshot.

Among those announcements are new public-private data sharing collaborations between the National Cancer Institute (NCI) and two major commercial cloud providers. Through these collaborations, we aim to further our commitment to the development of a national cancer knowledge system — a unified effort to collect, integrate, and share cancer datasets generated by researchers across the country — which has the potential to greatly accelerate precision medicine in oncology.

Overcoming Obstacles

Cancer research that links molecular information, like genetic data, with patient data, such as demographics, treatment, and outcomes, can greatly inform precision medicine-based treatment strategies and therapeutic approaches. However, such research can be hindered by obstacles investigators face when attempting to access and analyze large datasets. Data from related cancer genomics studies, for example, may be stored in different databases and formats, preventing researchers from combining and analyzing the data as an integrated dataset. And many researchers do not have the capacity to download and store the petabytes of data generated by genomic studies on their local institution’s computers, restricting their ability to perform comprehensive analyses.

It was with these obstacles in mind that, earlier this year, NCI launched two groundbreaking programs as part of the President’s Precision Medicine Initiative: the Genomic Data Commons (GDC) and the Cancer Genomic Cloud (CGC) Pilots.

The GDC collates and unifies several massive genomic datasets, including those from two NCI-supported research programs, The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Treatments (TARGET). But the data in the GDC is not exclusively generated by NCI-supported research; institutions and researchers everywhere have the ability to add cancer data to the GDC. In fact, the Multiple Myeloma Research Foundation pledged to add data from more than 1,000 patients with multiple myeloma, and earlier this year, Foundation Medicine, Inc. declared its intent to contribute data from 18,000 patients. The GDC provides researchers with secure access to these datasets in ways that protect patient privacy.

The CGC Pilots, meanwhile, provide innovative methods to query, visualize, and analyze cancer data. These cloud-based systems eliminate the need for researchers to download and store extremely large datasets by allowing them to bring analysis software to the data, instead of the traditional process of bringing data to the software. They also offer the computational capacity, unavailable elsewhere outside of supercomputer centers, necessary to analyze these data.

Looking to the Clouds

To make cancer data in the GDC available to the widest possible audience, NCI intends to collaborate with two major commercial cloud providers: Amazon Web Services and Microsoft. Each commercial cloud expects to host a copy of the genomic data maintained by the GDC, at no cost for two years. The NCI intends to make the data on these clouds accessible to researchers through the GDC and CGC Pilots.

During the 2-year collaborations, NCI aims to work with our new cloud partners to understand data usage patterns so that we can develop a sustainable strategy for optimizing data storage and utility, while limiting costs. And in the future, NCI anticipates collaborating with other commercial cloud providers to broaden access to the data even further. These kinds of collaborations seek to support efforts to make data that can expedite the discovery of new cancer treatments readily available to researchers, and help NCI determine the best ways to sustain that accessibility.

NCI is also extending funding of the CGC Pilots for an additional year and broadening their scope to include additional types of data. The three CGC Pilots were individually built by The Broad Institute, The Institute for Systems Biology, and Seven Bridges Genomics. Each CGC Pilot currently hosts a copy of the TCGA dataset, but a major goal of the expansion is tighter integration with the GDC, allowing the incorporation of other datasets stored in the GDC. The CGC Pilot expansion also aims to allow the developers to pioneer methods for accessing and analyzing other data types, such as medical imaging and protein (called proteomic) data.

The commercial cloud collaborations and the CGC Pilot expansion seek to enhance cancer researchers’ abilities to make new discoveries from an unprecedented range of datasets and data types — a range that, we hope, will continue to diversify. Research stemming from analysis of these data has the potential to influence methods for better prevention, detection, diagnosis, and treatment of many kinds of cancer. In other words, it could significantly change cancer outcomes and benefit patients around the world.