With sports (and everything else) cancelled, this data scientist decided to take on COVID-19 | A Winner’s Interview with David Mezzetti

When his hobbies went on hiatus, Kaggler David Mezzetti made fighting COVID-19 his mission.

Kaggle Team
Jul 29, 2020 · 8 min read
Photo by Clay Banks on Unsplash

Let’s learn about David!


David, what can you tell us about your background?

David: My technical background is in ETL, data extraction, data engineering and data analytics. I spent over a decade of my career developing large-scale data pipelines to transform both structured and unstructured data into formats that can be utilized in downstream systems. I also have experience in building large-scale distributed text search and Natural Language Processing (NLP) systems.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

I’ve worked in the data analytics space for 15+ years but did not have prior knowledge of medical documents or the medical industry.

How did you get started competing on Kaggle?

I’ve participated in a couple March Madness competitions. I was looking forward to the 2020 tournament and had a model I was very excited about. The way the season went was perfect for the strengths of the model but we’ll never know how it would have performed.

What made you decide to enter this competition?

When the 2020 March Madness competition was cancelled and COVID-19 was really starting to hit hard, I wanted to find a way to get involved and help. NeuML was working on a real-time sports event tracking application, neuspo but sports along with everything else was being shut down and there were no sports to track.

Let’s get technical

Tell us about the overall architecture or approach to the problem.

The solution consisted of two main parts, a sentence embeddings based search index and a custom BERT QA model to extract column based answers, known as summary tables for a specific list of questions.

Did any past research or previous competitions inform your approach?

Much of the search logic was based on a prior project, codequestion (https://github.com/neuml/codequestion). codequestion builds a sentence embeddings index over coding questions to match developers questions with previously asked questions/answers. Given that I already had that code base, I took that approach when starting with the CORD-19 dataset and much of the code is still derived from codequestion today.

What preprocessing and feature engineering did you do?

The CORD-19 dataset has a metadata CSV file with the full list of documents along with full-text stored in separate JSON files. An ETL process was built to take the CSV, find the corresponding text articles and load the data into a SQLite database. The text is then broken down into sentences per document, and those sentences are mapped to sentence embeddings using a BM25 + fastText method described in this Medium article.

What supervised learning methods did you use?

All search and question-answering was unsupervised using fastText+BM25 and a BERT based model for QA.

What was your most important insight into the data?

The CORD-19 dataset is dynamic and growing. I saw that almost everyone, including myself took the first approach of building a search index that allowed finding documents based on matching tokens. Additionally, summarization was seen as a way to also add value. The thought being to show researchers all data on a particular term or concept. Building on the previous point on study design, not all documents are of equal value. Labeling documents with a study design proved to be greatly beneficial in allowing researchers to review a document vs just showing documents with matching tokens.

Were you surprised by any of your findings?

I had little to no expectations entering this competition, so I wouldn’t say I was surprised by anything. It was great to see so many smart and capable people all working together to try to help in whatever way they could.

Which tools did you use?

All of the work is driven by the Kaggle platform. The list of notebooks cover all the submissions for Round 1 and Round 2 of the CORD-19 challenge. All of the notebooks are in Python.

How did you spend your time on this competition?

The early days of the effort were spent on EDA and exchanging ideas with other members of the community. Before models could be built, gaining an understanding of the data, strengths and weaknesses of the dataset and what researchers are looking for out of the CORD-19 dataset was needed. I was fortunate enough to find like-minded data scientists who were willing to roll up their sleeves and write code to help discover what we want from the data. It wasn’t until 1–2 months into the effort that machine learning models and feature engineering were even considered. Most of Round 1 was focused on data extraction, parsing, requirements analysis and building a system to search for documents.

What does your hardware setup look like?

All of the submissions were built on the Kaggle platform as CPU Notebooks. Development was done on a quad core laptop with a 8GB GPU and 32GB of RAM. The fastText embeddings, study design models, and custom BERT QA models were built offline using this laptop.

What was the run time for both training and prediction of your winning solution?

Given that the data is continually updated, there is a recurring job that runs each update (using kernelpipes). It takes about 6 hours to fully ETL, build the models and run all the solution notebooks on Kaggle.

Words of wisdom

What have you taken away from this competition?

This challenge was unique for a number of reasons. First there was no known answer, this was a real-world problem like you would encounter in industry, where someone has a large dataset and they aren’t sure what to do with it. This approach requires an iterative process of exploring the data, sharing feedback with experts and building a workflow to solve the problem. The data scientists involved in this effort were extremely fortunate to be guided by Savanna Reid, an epidemiologist volunteering her time. We were also fortunate Kaggle was heavily involved with Paul Mooney and Anthony Goldbloom helping guide the effort. I was fortunate to be able to bounce ideas off other data scientists working the effort, specifically Ken Miller and Andy White.

Looking back, what would you do differently now?

Entering the competition, my first instinct was to use sentence embeddings since I had an existing similar project. If starting over, I would have explored different methods to search the documents to see if any other methods performed better.

Do you have any advice for those just getting started in data science?

Much of your time will be spent on data preparation and feature engineering. The best way to learn data science is to solve a problem you’re interested in. Sports analytics is how I got started in data science. This was an engaging way for me to stay focused not only in the algorithms but the data itself.

Kaggle Blog

Official Kaggle Blog ft.