Our Favorite Kaggle Kernels using COVID-19 Datasets

Eugene Olkhov
CompassRed Data Blog
3 min readApr 1, 2020

A couple of weeks ago, a collaboration of various research groups released a huge dataset of literature which can be used to gain insights into the current COVID-19 pandemic. Additionally, Johns Hopkins updates a dataset that tracks confirmed cases/fatalities/recoveries daily. A call to action to the data community was given to help join the fight against the virus.

Two competitions were created on Kaggle, and many people started to share their take on working with the data-sets and the different things that they have uncovered. We at CompassRed have been fascinated with a lot of the work shared. Here are our favorites so far from each of the competitions.

1. COVID Global Forecast: SIR model + ML regressions

The first kernel was published by Patrick Sánchez, which utilized the Johns Hopkins dataset linked above.

This is a fantastic Python notebook that is updated frequently by the author. It contains a thorough exploration of the current trend of the virus across different countries and provides a good overview of the recent history.

The first attempt at forecasting future cases by the author is to use an SIR (Susceptible, Infected, Recovered/Deceased) model. This is a well known model in the epidemiology field that simulates how a virus in general will behave in a population given the contagion and recovery rates. This is covered more in-depth in the notebook, and shown how to calculate the values and create the model.

Following the SIR model, the author utilized a regression approach. Lag and date-based variables were created in order to turn this into a time-series problem, as well as some data enrichment by joining information about all of the countries. While predictions were not bad in the beginning for many of the countries, they decayed very quickly as predictions stretched farther into the future. In order to remedy this, the author took the early predictions and fed them back into the training data which improved the results by a large margin.

Highly recommended to check this kernel out!

2. CORD-19: Explore Drugs Being Developed

This second kernel used the literature dataset, attempting to uncover effectiveness of drugs being used and developed to treat COVID-19 patients. The literature consists of over 29,000 scholarly articles related to different coronaviruses. Out of these, approximately 14,000 contain full texts.

The key part of this analysis is the authors used NLP (Natural Language Processing) techniques to uncover correlations between different molecular structures of drugs. They then were able to connect to the public CHEMBL database in order to search for drugs that are already similar to the ones mentioned in the current literature.

This type of work is a great first step into making sure that researchers can quickly find key information on drugs being considered to be used for treatment without having to spend hours on going through reading papers manually.

Bonus

While not a Kaggle kernel, an excellent resource for an overview of the literature dataset is David Robinson’s Screencast Series:

In the screencast, David shows how to ingest the data and conduct exploration in R. David also utilizes the scispaCy package which contains a dictionary of medical terminology which makes the NLP tasks he conducts more meaningful.

--

--