Using Open Data to Fight COVID-19

How we used Meta to empower computational scientists to mine coronavirus research

By Alex D. Wade

Creative rendition of SARS-COV-2 virus particles. Note: not to scale. Credit: National Institute of Allergy and Infectious Diseases, NIH

Accelerating Research Amidst a Global Crisis

Earlier this year, as scientists struggled to learn about COVID-19, immediate access to research and clinical findings became critical for the research community to share and learn from each other, and open science has become pivotal in addressing the pandemic.

Open science is an effort to embrace each step in the research process as an opportunity to share, improve, collaborate, and accelerate the cycles of science. More broadly, open science is a set of practices that aims to make science more equitable, reproducible, verifiable, and efficient. Open sharing of software code, protocols, and data allows researchers to re-use resources and reproduce experiments, or to formulate new hypotheses and experiments. And the sharing of preprints, on platforms such as bioRxiv and medRxiv, can accelerate the time to discovery and is already accelerating efforts to combat COVID-19. Early sharing of preprints on COVID-19 grew in ways not seen in other disease outbreaks, such as Zika and Ebola.

In March, the White House Office of Science and Technology Policy (OSTP) issued a call to the AI/ML community, inviting them to apply state-of-the-art Natural Language Processing (NLP) approaches to the growing volume of research related to COVID-19, as well as to earlier research on the broader coronavirus family (including the SARS and MERS outbreaks). In support of this effort, the Chan Zuckerberg Initiative’s Meta team joined Georgetown University’s Center for Security and Emerging Technology, the Semantic Scholar team at the Allen Institute for AI (AllenAI), Google’s Kaggle, and Microsoft Research in a collaboration to help accelerate our collective ability to analyze and better understand the vast and growing amounts of research and clinical reports related to coronaviruses. The COVID-19 Open Research Dataset (CORD-19) is the result of this collaboration.

COVID-19 Open Research Dataset

CORD-19 was designed to remove the impediments to analyzing a large corpus of COVID-19 papers — providing a refinery and a pipeline for the AI/ML community to easily analyze the COVID-19 literature and provide potentially actionable insights for researchers and clinicians. Our goal with the CORD-19 project was to supply AI/ML and biomedical researchers with the largest collection to date of publications and preprints on COVID-19, SARS, and MERS in a consistent and machine-readable format. The National Institutes of Health’s National Library of Medicine also worked with publishers to expand the open-access coverage to as many relevant scientific publications as possible.

This team of collaborators had our first call to discuss this idea on March 9th, and we managed to release the first version of the dataset on March 16th. The data were refreshed with new publications every week through May; since then, they have been updated on a daily basis. The initial dataset release contained nearly 29,000 records, 44% of which (or ~13,000 records) included the full text of the publication. Today, the dataset has grown to over 242,000 records, 42% of which (~102,000) include the full text.

This is core to Meta’s goal, which is to help biomedical researchers keep up to date with the latest research. Meta aggregates, mines, and indexes content from many sources, including academic publishers, the US National Library of Medicine, and preprint repositories, and presents them to users in continuously updated feeds.

Distribution of coronavirus papers by publication year (source: CORD-19 preprint)

CORD-19 carries short- and long-term promise for COVID-19 response efforts and machine-learning applications more broadly. With global cases nearing 28 million and growing, research breakthroughs are urgently needed. Upon launch, the data were made available via Amazon Web Services as well as on Google’s Kaggle platform. Through Kaggle’s Tasks feature, we invited artificial intelligence experts around the globe to apply data- and text-mining approaches to high-priority research questions, and to share their code and data back to the community.

CORD-19 enables rapid analyses of the latest coronavirus literature, helping to identify new trends and insights as our understanding of the disease expands. It would take a single researcher years to read and extract the body of knowledge contained in the deluge of research papers published so far in 2020. Unlike human-driven literature reviews, CORD-19 updates daily to incorporate novel research to fuel the text-mining tools supplied by the AI community.

CORD-19 enables rapid analyses of the latest coronavirus literature, helping to identify new trends and insights as our understanding of the disease expands.

Since CORD-19’s launch, this collective effort has produced a free and open dataset of more than 242,000 scholarly articles on the coronavirus family of diseases, with 2.6 million views, 135,000 downloads, and over 1,600 shared code contributions back to the Kaggle community. While the dataset is merely an aggregation of that which has already been published, the results of CORD-19 simplify data-mining efforts and offer encouraging lines of inquiry for the biomedical research community as they seek treatments and a cure. We should acknowledge, however, that due to licensing restrictions less than half of articles in the dataset are available as full-text, preventing researchers from getting the complete picture of COVID-19 and its impacts. Opening up scientific content for data- and text-mining analyses was key to the success of this project, and this model may have enduring implications. However, this approach will be constrained until all research content has been made fully open access.

CORD-19 demonstrates the potential of collaboration between government, industry, and academia. Its launch six months ago offered a quick solution to the lack of open, up-to-date, and machine-readable data for coronavirus literature. Driven by a global crisis, it has succeeded because of the OSTP’s request and the collaborative efforts of all involved to quickly bring this resource to fruition. As CORD-19 shows, the research community has the will to mobilize in service of important and difficult scientific questions. It’s now incumbent upon all of us to provide the direction and unleash the power of open data for machine learning to tackle its next set of problems.

To learn more about CORD-19, read the preprint. To access the dataset, visit the Semantic Scholar download page.

As a project of the Chan Zuckerberg Initiative, Meta is free — and accessible to researchers everywhere. Sign up for your account at

Meta is a free research discovery tool from the Chan Zuckerberg Initiative, providing a faster way to understand and explore science through personalized feeds.