SciSight: Helping scientists visualize and explore COVID-19 literature with AI
By Tom Hope and Jonathan Borchardt
Doctors and scientists worldwide are joining forces in an unprecedented concerted effort to understand and treat COVID-19. Racing against the exponentially growing number of infections, researchers are beginning to make advances. Creating proteins tailor-made to help stop the virus, identifying viral genome sequences, and gathering evidence on viral transmission routes are some prominent examples.
However, researchers in biology and medicine have long been wrestling with a very different kind of exponential growth — the flurry of over a million biomedical research papers getting published every year, at a rate that continues to rapidly increase. In a matter of several months, a few thousand papers have been published on the new coronavirus, with over one thousand appearing on the open preprint sites bioRxiv and medRxiv alone.
There are many studies on previous coronaviruses and other closely related areas in virology, epidemiology, and biology, containing potentially valuable knowledge for connecting the dots across the complex network of science. This wealth of existing and new ideas represents not only an opportunity but a challenge in keeping up, and in making sure we have the most pertinent knowledge readily available. This is especially critical in times when new information is both rapidly emerging and urgently needed, and knowledge sharing across labs is required.
Toward addressing this challenging problem, our team at the Allen Institute for AI and Semantic Scholar launched the COVID-19 Open Research Dataset (CORD-19), a corpus of tens of thousands of papers related to past and present coronaviruses, that continues to be updated on a weekly basis and is already being used by multiple research groups.
Today we take another step and launch SciSight, an AI-powered graph visualization tool enabling quick and intuitive exploration of associations between scientific concepts in the CORD-19 corpus — initially focusing on proteins, genes, cells, drugs, and diseases, which are fundamental to the study of the virus. Our goal is to give researchers a clearer picture of what information the CORD-19 dataset contains, and also to help them discover new and relevant knowledge. To extract relevant information from papers, we used AI2’s SciBERT, enhanced by training on a larger corpus of papers and fine-tuned on several biomedical entity recognition tasks. While already showing promising preliminary results, in the coming weeks we hope to enhance SciSight with finer-grained relations between entities, better disambiguation to handle the richness of scientific language, and more features to dynamically visualize and explore the emerging literature network around COVID-19.
Discovering the unknown unknowns
Users of SciSight can search for a term/concept of interest, or get suggestions based on important COVID-19 topics. Searching for a term displays a network of top related terms mined from the corpus. For a timely example, let’s look at Chloroquine, the Malaria drug that recently created some controversy regarding its potential re-purposing. Users searching for Chloroquine can see its network of associations, such as its potential connection to liver damage.
Clicking an edge shows all related papers, and links to full papers for users who want to dig deeper. Clicking a term opens up its own network of associations, traversing and exploring the network of scientific concepts.
In the process of developing SciSight, we’ve begun conducting interviews and preliminary user studies with practitioners and experts, who have provided valuable insights about their needs. For example, Dr. Lia Schmitz, PharmD, said that SciSight “offers a different and potentially powerful new way to sort through publications. The need for such a tool has become exceptionally obvious in the recent rush of information surrounding the viral pandemic. I am frequently asked my clinical opinion on combinations of medication therapies that, until about three weeks ago, were virtually unheard of.” For example, Dr. Schmitz said, requests from healthcare facilities she serves are often vague, such as “what are the latest recommendations for medication management?”
To help her answer such questions, “it is helpful to start with keywords which I know are relevant, and see which co-mentioned terms come up as I search. For example, I learned that Disulfiram had been studied in vitro to fight the virus. When you don’t know what you don’t know, SciSight can help reveal connections that publications are starting to make. It has capabilities beyond the search engine I currently use, PubMed’s MeSH search.”
In addition to exploring within a field, connecting the dots between fields is known to be a major catalyst for innovation. Noa Granot, MD and postdoctoral research fellow at the Fred Hutchinson Cancer Research Center, was aware of hematological manifestations of COVID-19 and found related papers she considered interesting by using SciSight. “This way of presenting the information, in the form of concepts and links between them, is an intuitive and convenient way to look into research, and also dovetails with how we think as medical professionals. It’s nice to have such a tool to complement the usual search engines we use in our field.”
In the future, we plan to enhance SciSight with finer-grained relations and links. One important example according to Dr. Schmitz, would be to help explore risk factors. “For example, I was interested in finding if Ibuprofen was a risk factor for coronavirus. This is muddy because ibuprofen is a risk factor for multiple unrelated things. Ibuprofen is commonly used to treat fevers and various viral effects. It is really powerful to have the ability to search relationships such as ‘is a risk factor for,’ it weeds out the papers irrelevant to the search.”
As a postdoctoral researcher at the University of Washington and on the Semantic Scholar team at the Allen Institute for AI, Tom Hope specializes in boosting innovation and scientific discovery with NLP and data mining. Jonathan Borchardt is a senior software engineer on the ReViz team at the Allen Institute for AI who specializes in user experience, data visualization, and UI best practices.