A Data Scientist’s View into the Cancer Moonshot Project: Part 3, Data Sharing

by Ola Topczewska

Civis Analytics
The Civis Journal
4 min readSep 15, 2016

--

Over the past few months, a team at Civis led by our CEO, Dan Wagner, has worked on researching and writing a report for Vice President Biden’s Cancer Moonshot Initiative. In this report, we analyzed how cancer research might benefit from better use of data and analytics and provided recommendations around three topic areas:

  • Data infrastructure
  • Data sharing
  • People and skills

This post will summarize our recommendations in the final category. You can read more about our work with the Moonshot here, as well as read about our recommendations for infrastructure and data sharing. The full report is also available here.

Much of the research that informed our final recommendations took the form of in-depth interviews with a wide range of individuals in the field. Our conversations with researchers were particularly helpful in developing our recommendations about technical skill development. We spoke with scientists at different institutions, across various career stages, and in different subfields (basic research, clinical research, and biostatistics). It was immediately clear that the cancer research community contains talented, skilled, and driven individuals and that incredibly promising research is being done. However, we saw room for greater communication and interdisciplinary collaboration in the subfields that involve analysis of large volumes of data, particularly genomic data.

While there are many researchers in the field with biostatistics and data science experience, the specific technical skill sets needed to work with data at a large scale are not universally common. Individuals we talked to indicated that graduate biology programs are only recently starting to provide students with robust data science and statistical inference training that meets the needs of the big data era. Consequently, it can be hard for research teams without that training to take advantage of the large scale analysis made possible by large datasets like The Cancer Genome Atlas (TCGA). This has an especially negative impact on research institutions outside of the leading universities and cancer research centers, and for teams that are not primarily computational biology labs. We spoke to one researcher who said there was an analytical bottleneck that results from a lack of skilled analysts with the technical capabilities to process data at the rate at which it is being generated.

We recognize that not everyone in the cancer research space needs to (or realistically should) be able to do advanced data science. Moreover, pure data science experience — without any foundational understanding of biology — is also not extremely helpful. The field needs subject matter experts in biology who are able to understand relevant data science at a high level, as well as biostatisticians who are able to work with large data and understand the biological context and implications of their analyses. This will allow teams to collaborate more effectively and broaden the pool of researchers who can benefit from and contribute to big data-driven cancer research.

In an ideal future, domain experts with training in biological methods and technical experts with training in biostatistics and data science would work together. At present, these two bodies of knowledge are somewhat siloed; researchers at either extreme of the spectrum don’t always know what questions to ask of one another and how existing methods might help them answer their research questions.

We see two important steps to bring data science into all aspects of cancer research:

  • There need to be more highly-skilled biologist-data scientists who can pioneer methodologies and build tools for the rest of the field to use. To get to a place with more collaboration between different methodological subfields of cancer work, we need more training for students and mid-career professionals in cancer biology research to get up to speed and be comfortable working more independently with data. We think that this training should expose students to statistics and computer science, ideally with a healthy dose of machine learning and model-building so that the methods of cutting-edge data science can be brought to bear against cancer.
  • The best tools and methods need to be made accessible to non-technical users so that everyone in the field can use data more effectively in their day-to-day work. In other words, we need more tools backed by data science best practices that are accessible to users with a wide range of technical backgrounds. One of the best ways to increase engagement with data is to make it easy for users to explore and ask questions of the data independently within a framework that allows exploration and collaboration, and which defaults to statistical best practices.

As more highly skilled people enter the cancer field, today and in the future, we also need to foster a better set of career options for them so they’re incentivized to remain in the field. Traditionally, individuals doing advanced biostatistics in an academic context have filled a supporting role that assists the work of other researchers. People we interviewed suggested that there should be more roles for full-time scientists in academia as “staff scientists” — people with domain experience, more autonomy, and a viable long-term career path that is competitive with available alternatives in the private sector. These roles — which already exist in other academic fields, like physics — play an important role in distributing methodological knowledge among the field, advocating for data and correct use of statistical methodology, and doing independent work to advance the methodological cutting edge.

Additionally, the government should support and encourage partnerships between the cancer research community and the tech industry to encourage the sharing of information and skills between these two sectors.

Training the next-generation of biological researchers, creating new interdisciplinary partnerships, and establishing enduring and valuable channels of communication between the tech industry and the cancer research community will take time and effort. At the end of the day, the people working in this field will be the number one determinant of the pace at which new therapies are developed. Already, the cancer research space is filled with talented, hard-working researchers; the next critical step is to create the conditions for their success by enabling more researchers to participate in the cancer big data revolution.

This post was co-authored by Angelo Mancini, Ola Topczewska, Katie Malone and Todd Harris.

--

--