Kaggle Blog
Published in

Kaggle Blog

With sports (and everything else) cancelled, this data scientist decided to take on COVID-19 | A Winner’s Interview with David Mezzetti

When his hobbies went on hiatus, Kaggler David Mezzetti made fighting COVID-19 his mission.

Photo by Clay Banks on Unsplash

David Mezzetti is the founder of NeuML, a data analytics and machine learning company that develops innovative products backed by machine learning. He previously co-founded and built Data Works into a 50+ person well-respected software services company. In August 2019, Data Works was acquired and Dave worked to ensure a successful transition.

David: My technical background is in ETL, data extraction, data engineering and data analytics. I spent over a decade of my career developing large-scale data pipelines to transform both structured and unstructured data into formats that can be utilized in downstream systems. I also have experience in building large-scale distributed text search and Natural Language Processing (NLP) systems.

I’ve worked in the data analytics space for 15+ years but did not have prior knowledge of medical documents or the medical industry.

I’ve participated in a couple March Madness competitions. I was looking forward to the 2020 tournament and had a model I was very excited about. The way the season went was perfect for the strengths of the model but we’ll never know how it would have performed.

When the 2020 March Madness competition was cancelled and COVID-19 was really starting to hit hard, I wanted to find a way to get involved and help. NeuML was working on a real-time sports event tracking application, neuspo but sports along with everything else was being shut down and there were no sports to track.

With sports and life on a hiatus, I saw the Kaggle CORD-19 challenge and felt I had the background to be able to contribute. On top of everything going on, my Mom passed away in early March. She was a high school biology teacher and would have been happy to know I was involved. This effort was also a good distraction from everything going on and a way to feel like I could do my part to help beat COVID-19.

Let’s get technical

The solution consisted of two main parts, a sentence embeddings based search index and a custom BERT QA model to extract column based answers, known as summary tables for a specific list of questions.

For each query, an embeddings query identifies the list of best matching documents. Common fields including date, title, authors and the reference url are stored as search result columns.

A custom BERT QA model was developed to add additional columns to the list of search results. For example, given a search of the CORD-19 dataset for “hypertension”, an additional column for the question “What is the risk factor of developing severe symptoms for patients with hypertension?” is added as a separate column.

Much of the search logic was based on a prior project, codequestion (https://github.com/neuml/codequestion). codequestion builds a sentence embeddings index over coding questions to match developers questions with previously asked questions/answers. Given that I already had that code base, I took that approach when starting with the CORD-19 dataset and much of the code is still derived from codequestion today.

The CORD-19 dataset has a metadata CSV file with the full list of documents along with full-text stored in separate JSON files. An ETL process was built to take the CSV, find the corresponding text articles and load the data into a SQLite database. The text is then broken down into sentences per document, and those sentences are mapped to sentence embeddings using a BM25 + fastText method described in this Medium article.

All search and question-answering was unsupervised using fastText+BM25 and a BERT based model for QA.

An important concept discovered early on was the importance of study design. All articles are not considered equal, and the medical community puts more weight behind different study types. For example, studies with a larger sample size (i.e. more patients) or systematic reviews are held in higher regard vs mathematical modeling/forecasting articles. A Random Forest classifier was built to analyze articles to determine the study design based on the word tokens and named entities within an article.

The CORD-19 dataset is dynamic and growing. I saw that almost everyone, including myself took the first approach of building a search index that allowed finding documents based on matching tokens. Additionally, summarization was seen as a way to also add value. The thought being to show researchers all data on a particular term or concept. Building on the previous point on study design, not all documents are of equal value. Labeling documents with a study design proved to be greatly beneficial in allowing researchers to review a document vs just showing documents with matching tokens.

Additionally, where tokens show up in a document is important. Some articles reference a concept in the introduction or discussion sections but the article doesn’t cover that concept. Most medical articles have methods & results sections and matches in those sections are more important.

I had little to no expectations entering this competition, so I wouldn’t say I was surprised by anything. It was great to see so many smart and capable people all working together to try to help in whatever way they could.

All of the work is driven by the Kaggle platform. The list of notebooks cover all the submissions for Round 1 and Round 2 of the CORD-19 challenge. All of the notebooks are in Python.

Sentence Embeddings Notebook

Round 1:

Round 2:

There is also a separate Python project on github, cord19q. cord19q has the logic for ETL, building the embeddings index and running the custom BERT QA model.

The early days of the effort were spent on EDA and exchanging ideas with other members of the community. Before models could be built, gaining an understanding of the data, strengths and weaknesses of the dataset and what researchers are looking for out of the CORD-19 dataset was needed. I was fortunate enough to find like-minded data scientists who were willing to roll up their sleeves and write code to help discover what we want from the data. It wasn’t until 1–2 months into the effort that machine learning models and feature engineering were even considered. Most of Round 1 was focused on data extraction, parsing, requirements analysis and building a system to search for documents.

The work of Round 1 led to discovering that building summary tables with extracted answers to a series of questions, would be most beneficial to the medical community. Fortunately, a team of medical experts manually curated a dataset that could be used to help build machine learning models. In Round 2, a BERT based QA model was developed to be able to extract answers from medical documents. This required building a custom question-answer dataset to teach a model how to answer medical questions. In Round 2, the majority of time was spent on building this model.

All of the submissions were built on the Kaggle platform as CPU Notebooks. Development was done on a quad core laptop with a 8GB GPU and 32GB of RAM. The fastText embeddings, study design models, and custom BERT QA models were built offline using this laptop.

Given that the data is continually updated, there is a recurring job that runs each update (using kernelpipes). It takes about 6 hours to fully ETL, build the models and run all the solution notebooks on Kaggle.

Words of wisdom

This challenge was unique for a number of reasons. First there was no known answer, this was a real-world problem like you would encounter in industry, where someone has a large dataset and they aren’t sure what to do with it. This approach requires an iterative process of exploring the data, sharing feedback with experts and building a workflow to solve the problem. The data scientists involved in this effort were extremely fortunate to be guided by Savanna Reid, an epidemiologist volunteering her time. We were also fortunate Kaggle was heavily involved with Paul Mooney and Anthony Goldbloom helping guide the effort. I was fortunate to be able to bounce ideas off other data scientists working the effort, specifically Ken Miller and Andy White.

It was an honor to volunteer and while I’ll never know the true impact these contributions made, I like to think it did a small part to help.

Entering the competition, my first instinct was to use sentence embeddings since I had an existing similar project. If starting over, I would have explored different methods to search the documents to see if any other methods performed better.

Much of your time will be spent on data preparation and feature engineering. The best way to learn data science is to solve a problem you’re interested in. Sports analytics is how I got started in data science. This was an engaging way for me to stay focused not only in the algorithms but the data itself.

Additional Medium posts by David Mezzetti:

Combating COVID-19 with Data Science
Building Analysis Pipelines with Kaggle



Official Kaggle Blog ft. interviews from top data science competitors and more!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Kaggle Team

Official authors of Kaggle winner’s interviews + more! Kaggle is the world’s largest community of data scientists. Join us at kaggle.com.