Data Science at the NIH and in healthcare

The National Institutes for Health (NIH) are on an ambitious effort to harness advances in data science, machine learning, and artificial intelligence (AI) to support programs like the Precision Medicine, Cancer Moonshot, and Brain Initiatives. To accelerate progress, the NIH made a call to the public for a Request for Information (RFI)on the proposed Strategic Plan on Data Science. I submitted my letter and a number of people asked me to make my letter public. Since, as soon as it is submitted, it becomes part of the public record and the submission has now closed, I’m put the full text below.

While this letter is specific to the NIH, there are many parts that are salient to the broader questions of ethics, security, and how we need to think about data going forward. You’ll notice many similar aspect from this on the need for data scientists to take increased responsibility and the opportunity to use “data for good”.

Most of all, I’d like to hear from you. What do you think?

26 March 2018

Dear National Institutes for Health;

Following the Request for Information on the National Institutes for Health (NIH) Strategic Plan for Data Science, I would like to offer my recommendations on the draft plan. This advice stems from a career in academia as a mathematician; in industry, leading data efforts at companies like LinkedIn where we are credited with co-coining the modern instantiation of the term, “data scientist”; in government as the U.S. Chief Data Scientist with responsibility for the Precision Medicine Initiative (PMI) and parts of the Cancer Moonshot; and as a former advisor to the NIH on data science initiatives.

Let me begin by commending the team on an outstanding draft framework. It is impressive how far the NIH has come since initial discussions of data science in NIH activities more than three years ago. In particular, let me applaud the efforts of the NIH staff who continue to champion the opportunity to improve Americans’ lives through science. Kudos to you all.

As with all drafts, there are opportunities to iterate. With that in mind let me offer the following thoughts:

1. Ethics and security. As was called out in the White House Artificial Intelligence and Automation Report, every training program on data, needs to have ethics and security integrated into the curriculum. Currently, students who learn about database design are not taught about basic attacks that can compromise user data. Additionally, there is little discussion on how to design and architect technical solutions that limit access when the systems are compromised and breached.

Today’s students are rarely taught about the ethical implications around data collection, analysis, etc. With data, as with many things, just because we can, doesn’t mean we should. Today’s current events demonstrate that increased regulation is likely and the NIH can, and should be leaders on this front. The NIH has a rich history of leadership in this domain, similar to the way that the biomedical field has led on bioethics.

As we continue to use data and other data techniques, such as machine learning and AI, it is critical to investigate how bias mitigation (for the data and the model) and model transparency can be implemented effectively. These are open questions in AI and machine learning research, but we are seeing the impact of bias in the algorithms and data in other fields (e.g., criminal justice bail risk assessment technologies).

The NIH should consider ensuring all training grants require that ethics and security be taught as part of the integrated curriculum (not just as electives). And ensure that these courses are well integrated not just within data science community, but also within the traditional biomedical components of academia. This is because the majority of future experts in the country will need to have some training in data science.

The NIH should insist on investing in new models of security such as bug bounty programs, which have been incredibly successful in the Federal Government (e.g., Hack the Pentagon which enabled the discovery of critical vulnerabilities within 13 minutes). Given the nature of AI being used to create new attacks, it will be essential that NIH find new ways to educate researchers about new threats. This should include closer collaboration with the Departments of Justice, the National Institute of Standards and Technology, and Homeland Security (DHS) in the same way that industry collaborates.

Finally the NIH needs to invest in understanding how model and data bias may implicate research and clinical care. This should also address questions of reproducibility and the “black box” nature of these techniques.

2. Law enforcement and access to data. The NIH should take progressive action to ensure that data that is contributed by volunteers cannot be accessed from law enforcement. Trust is consistency over time and the NIH has worked hard to address the wrongs of the past (e.g., the Lacks family). If law enforcement obtains any sensitive information from medical data sets, it will cripple efforts such as the All of Us campaign. While this may seem out of the realm of possibility, census data was used during WWII to identify Japanese-Americans for internment. Also, there are discussions of DHS using the database of location information submitted by DACA recipients to track them down and deport them. And now the 2020 Census is slated to ask about citizenship. These trends risk undermining the public confidence in NIH data and research efforts especially with the inclusion of more sensitive data like genetic and genomic data.

3. Common Rule. While great progress was made in reforming the Common Rule, it is just the beginning. Already, the updates to the Common Rule are lagging behind technology and public sentiment on consent. Additionally, updates are being delayed further by the Office of Human Research Protections.

Unfortunately, the current model of Institutional Review Boards (IRBs) are not sufficient to manage given the rate of change in technology. For example, part of the reason for the delay of the All of Us Program’s launch was the conflict between rapid, agile, iteration of technology and a cumbersome IRB process that had to review each and every minor change in language that was participant-facing. In some cases, IRB review of a simple wording edit on a web-based platform could take weeks. Additionally, the research community needs to be able to do ad hoc “mining” of large, combined datasets to find correlations that can lead to insights and “traditional” clinical research. NIH should train and support IRBs as they consider applications of new technology to research (both benefits and risk).

Given the pace of these changes, it is critical that the federal government more regularly update the Common Rule across the 18 agencies involved, including engaging the Office of Management and Budget, and proactively developing consensus externally with the broader research community. Agencies should consider mandating regular updates, on a 2 or 3 year timeline, to ensure that we don’t leave an economically significant rule and industry without important guidance for another 20+ years.

4. Machine Learning (ML) and AI are going to change the game. As the draft strategic plan describes, ML and AI are transforming every industry. To make sure that NIH capitalizes on these advancements, it is essential that the NIH think outside of its traditional models of funding and thinking. The most aggressive investments in ML/AI are taking place outside of typical NIH grantees. This includes computer science and data science departments as well as industry (Google, Facebook, Microsoft, and Amazon). The NIH should consider new models for partnership with these groups, as there is limited incentive for them to collaborate with the NIH due to the size of the data they collect and their lack of need of funding.

The NIH should recognize the lessons learned from the DARPA Grand Challenge that kick started the self-driving car movement; and, in particular, the lesson that the general consumer is likely to benefit sooner than the Department of Defense.

It is industry (primarily consumer internet and e-commerce) that has been driving technology innovation in data science. This is due to investments in hardware, supporting the open source movement (Kafka [which was created at LinkedIn], Hadoop, Spark, etc). As such, these technologies are optimized for industrial problems rather than the problems that support the NIH mission.

There is a lesson from the National Weather Service (NWS) that is applicable as the investments of hardware companies are increasingly aimed to support the Internet instead of other domains such as weather forecasting. The U.S. has fallen behind on our investments in supercomputing compared to the Europeans and the Japanese in relation to the applications required to improve weather forecasting. This would have been, potentially, remedied with increased collaboration with industry.

Finally, there needs to be better investment in the “cleaning” and extract, transform, and load (ETL) of data. As I have pointed out many times and validated by Crowdflower, and as many NIH researchers already know, cleaning data is 80% of the work. The tools of today are still subpar and limit the ability to bring large data sets together in a timely and cost-efficient manner. The investment in this technology is happening in industry through startups and larger corporations, and Federal Agencies such at the Department of Defense and National Science Foundation. To ensure that these technologies also benefit the broader needs of the NIH, the NIH should actively engage in joint partnerships for research and development.

5. Increasing Federal Collaboration. One of my greatest concerns when I was the U.S. Chief Data Scientist, was the lack of collaboration between Federal Agencies. This is why the Data Cabinet was created with the goal of improving Federal data collaboration and includes more than 40 Federal Chief Data Officers/Scientists.

The NIH should make sure to participate in those meetings to learn and share best practices. Additionally, the primary funding efforts in data science, machine learning, and AI are taking place at the National Science Foundation (NSF), the Departments of Defense, Energy, and Commerce (NOAA and Census). The NIH should continue to find new models for partnership with the Food and Drug Administration (FDA), the Centers for Disease Control (CDC), and the Department of Veteran Affairs. Each of these organizations has key data that, when brought together with health data, has the potential to revolutionize medicine.

Of note, other governments have recognized the value of this approach and are aggressively investing in the multidisciplinary approach to leverage data biology to gain a competitive advantage (e.g., China’s $9 billion investment in precision medicine, the U.K. Biobank, etc).

6. Increasing access to data. There cannot be a one-size-fits-all approach to data. In some use cases there needs to be large data sets and in others there need to be APIs. A good example of this is clinical trial data and the projects to improve API access to the data. The NIH should find ways to continue to open up access to data to a broader set of users. And as we’ve seen in industry, this will fuel further innovation as the public learns to build new things with the data.

One of powerful assets that the NIH uniquely has are data sets. Examples include dbGaP, the All of US program, and These are national treasures and enable a unique level of citizen science, and ability for academics to efficiently leverage the NIH. To deliver on the data science mission, the NIH should continue to support M13-13 and that all data by default should be open and machine readable.

Let me conclude by reiterating my gratitude. Thank you for continuing to recognize the opportunity for data to improve the lives of every American.

DJ Patil — Former U. S. Chief Data Scientist

*Note parts of the letter were underlined, but Medium doesn’t support that functionality. The full underlined letter can be found here