Gaining a sense of control over the COVID-19 pandemic | A Winner’s Interview with Daniel Wolffram

Kaggle Team
Jul 23, 2020 · 8 min read

How one Kaggler took top marks across multiple Covid-related challenges.

Photo by Markus Spiske on Unsplash

Today we interview Daniel, whose notebooks earned him top marks in Kaggle’s CORD-19 challenges. Kaggle hosted multiple challenges that worked with the Kaggle CORD-19 dataset, and Daniel won 1st place three times, including by a huge margin in the TREC-COVID challenge. (He had a score of 0.9, 2nd place overall had a score of 0.75, and 2nd place on Kaggle had a score of 0.6.)

Let’s meet Daniel!

Daniel, tell us a bit about yourself.

As part of the Kaggle CORD-19 challenge I developed discovid.ai — a search engine for COVID-19 literature. Right now, I’m working on the German COVID-19 forecast hub and writing my master thesis about building and evaluating forecast ensembles for COVID-19 death counts.

Well, it’s no surprise you took top marks in the CORD-19 Challenge! That’s quite relevant!

During my time as a student assistant, we’ve also consulted a company that works with a lot of text data — that’s where I gained my first experience in NLP and also came across the idea of finding similar documents with the help of a topic model. At that time, our client wanted to stick with another approach, so I never really got to try out the LDA approach, but it always stayed in the back of my mind.

How did you get started competing on Kaggle?

What made you decide to enter this particular competition?

Moreover, when the competition was launched, Covid cases were climbing in Germany, where I live. The first protective measures to flatten the curve were taken here — all restaurants, shops (except supermarkets and drugstores) and leisure facilities were closed. My university was closed and all exams got cancelled. More shocking were the numbers from Italy and elsewhere. It was a very intimidating and uncertain atmosphere, so this challenge was actually a way to gain back some control by facing the crisis head on by simply using my skills for the best. I was aware that it might not have the biggest impact, but what kept me going was the thought that if even one medical researcher uses my model and stumbles upon something useful, my efforts were already worth it.

Let’s get technical

What preprocessing and feature engineering did you do?

For the topic model to work properly, it was also necessary to perform language detection and remove non-English documents.

All the details can be found in my preprocessing notebook: https://www.kaggle.com/danielwolffram/cord-19-create-dataframe.

To further augment the data, I also searched each article for clinical trial ids to link the document to the WHO International Clinical Trials Registry Platform (ICTRP), which required hand crafting several regular expressions — the details can be found in https://www.kaggle.com/danielwolffram/cord-19-match-clinical-trials.

What machine learning methods did you use?

On discovid.ai the topic model is now used to find related articles — the idea is that each article is composed of a set of underlying topics and if we find articles with a similar topic mixture or an overlap in topics, they might be interesting for the reader and could spark new insights.

Topic mixture of a selected paper and of a related article
Click through for interactivity: https://dwolffram.github.io/cord19_lda_topics/

Here you can explore 50 topics that our model found within the corpus — each topic is a distribution over words and each document can then be seen as a mixture of these topics.

What was your most important insight into the data?

  • Topic #46: der die und bei mit von eine ist werden zu für sind oder einer des den nicht das als nach zur auf durch auch ein
  • Topic #40: de les des en une est dans du par un ou sont pour plus au que avec chez sur d’une qui cas être pas ces
  • Topic #32: de en el los que se con las por un es para pacientes como más virus son tratamiento su infección puede ha casos enfermedad entre
  • Topic #7: un che con sono nel alla più ha tra gli degli come rischio ed pazienti nella nei osteonecrosis ad essere stato studio salute anche have

As you can see, there was one for German, French, Spanish and Italian. To me this was very encouraging, because it demonstrates how powerful LDA is in learning hidden structures and that it actually learns something meaningful.

Were you surprised by any of your findings?

How did you spend your time on this competition?

What was the run time for both training and prediction of your winning solution?

Teamwork

How did your team form?

How did your team work together?

How did competing on a team help you succeed?

Just for fun

What is your dream job?

Words of wisdom

What have you taken away from this competition?

Do you have any advice for those just getting started in data science?

Also, I think it’s always important to first get a clear understanding of the problem you are trying to solve, before throwing the most complex machine learning models on it.

You can find Daniel’s winning submission for CORD-19 here: https://www.kaggle.com/danielwolffram/discovid-ai-a-search-and-recommendation-engine

Kaggle Blog

Official Kaggle Blog ft.