Combating COVID-19 with Data Science

A perspective from an unexpected participant

Photo by Martin Sanchez on Unsplash

This story covers my experience using machine learning and data science to help researchers find answers in the COVID-19 Open Research Dataset (CORD-19). CORD-19 was released “to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease”. What started as a effort to chip in and help, has led to the work I’ve done being covered in a Wall Street Journal article and cited on the COVID-19 Kaggle community contributions page.

This article gives a background on how I got involved, along with the evolution of my technical approach to the dataset.

Starting a new company

Last year, Data Works, a company I co-founded and helped build from the ground up went through an acquisition. It was a good time to move on to new adventures and we were extremely fortunate to find a buyer who would take the company to the next level, while treating people the right way.

After a transition period, it was time to think about the next adventure. This year, I founded NeuML, a machine-learning software and services company. The first concept for NeuML was neuspo, a real-time sports event tracking and analytics site, discussed in this article.

I was extremely excited and looked forward to the NCAA Tournament. Tournament predictions were being published on neuspo and a model was ready to be used in the Kaggle March Madness Competition. It was also exciting to see how neuspo would handle tracking the craziness that is March Madness.

On March 7th, NeuML had the first version of it’s website online. I exchanged text messages with my Mom to show her the site and we discussed bracketology, she was really into sports. It was the type of conversation I had thousands of times. A few hours later I received a call that my Mom had passed away unexpectedly.

Moving Forward

Photo by Ross Parmly on Unsplash

After spending the next week traveling, going through the process of losing a parent, the world rapidly came to a standstill. The week of March 9th, 2020 will be a week we all remember, as we watched COVID-19 spread, infecting people in all walks of life including sports players and movie stars. COVID-19’s indiscriminate targeting, led to all major sports putting their seasons on hold. March Madness was cancelled, sports like all other aspects of life was put on hold.

neuspo was now also on hold. With life being in quarantine, I wanted to find a way to help and found the CORD-19 dataset on Kaggle. Despite having no medical background and being unfamiliar with both the fields of epidemiology and medical literature, I thought I could help by applying my background in data engineering and analytics.

Searching CORD-19

Photo by João Silas on Unsplash

My initial approach was to download the dataset and build a semantic similarity search index. The cord19q project was started on GitHub. Work on cord19q was able to utilize a similar project, codequestion, which finds similar answers to technical questions.

Given the raw, unstructured nature of the data, transformation processes were needed to get the data into a format that could be utilized by machine learning models. A similarity index was built over the transformed data. The advantage of a similarity index vs a keyword index is that it has support for phrase and term variations, allowing us to find not only exact but similar content.

cord19q was integrated into a Kaggle notebook titled CORD-19 Analysis with Sentence Embedding, which builds reports to answer task questions as part of the CORD-19 research challenge. The notebook started to gain traction as the Kaggle team shared results from the challenge with the medical community. It was good start but much more was needed.

Developing a skepticism of research data

My initial mindset going into this challenge was that search was key and that we need to bring text matches to the attention of researchers. Much of my work was focused on finding the most concise matches to a query and bringing those to a researchers attention.

With a limited medical background, my understanding of how medical literature is evaluated was limited. A snippet of text in a medical article isn’t as helpful without the context of how those conclusions were drawn. For example, a text match in a review article isn’t as important as a conclusion drawn from a large medical study. A study with a sample size of 5 participants holds much less weight than a study with 500 participants. The methodology of how data is collected or patients enrolled is also scrutinized.

This is more clear now but it was not on my mind at the start of this effort. More needed to be done to extract this information and guide researchers.

Study Design

Photo by Alfons Morales on Unsplash

Extracting the backing study metadata became the primary focus of my work on CORD-19. The effort on Kaggle now has a team of curators who are going through to aggregate the best results over a collection of sources. The more that can be done to help curators and medical researchers quickly triage an article, the better.

My initial approach was a rules-based approach using a pre-defined vocabulary to label studies with a study design type. A rules-based approach was also used to extract study sampling methodology, size and study statistics. This approach generated decent results but wasn’t going to scale. It did allow me to learn about the domain and data more, which is critical in teaching a computer to learn. We can’t teach what we don’t know.

The next approach was to move to a machine learning back approach. A dataset with labeled study design metadata was aggregated together. Much of this work required going line by line, reading sentences and articles. It was also able to utilize manually labeled studies as part of the curation effort. Two models were built to classify study design and study attributes, the technical approach is discussed here. This model increased the level of accuracy while giving a repeatable process to continue improving upon the accuracy with further labeling and training of data.

What’s next

This is an ongoing effort and new people are joining everyday to help. Data is updated weekly, with systems being reviewed to highlight the most relevant COVID-19 research across a wide array of topics. If you’re interested in getting involved, take a look at this discussion on Kaggle.

I am honored to work with many smart and talented individuals, seeing medical experts join together with technical experts to help beat COVID-19. I would like to thank Anthony Goldbloom and his team at Kaggle for organizing and leading this effort. They’ve been extremely helpful and supportive. The Kaggle infrastructure has enabled me to quickly build analytics products and widely share the output.

I look forward to a return to normalcy. It certainly is a unique period for me as it is for many. Someday we’ll reflect back on the uncertainty and instability of these times. I’m looking forward to the day when we’re talking about sports, bracketology and how Tom Brady ended up on the Bucs. Until then, we have work to do to help get us there, in whatever capacity we can.

Founder/CEO at NeuML — applying machine learning to solve everyday problems. Previously co-founded and built Data Works into a successful IT services company.