Building Analysis Pipelines with Kaggle

Do more than compete

David Mezzetti
Towards Data Science

--

Photo by Luke Chesser on Unsplash

Kaggle is one of the most popular places to get started with data science and machine learning. Most in the data science world have used or at least heard of it. Kaggle is well-known as a site that hosts machine learning competitions and while that is a big part of the platform, it can do much more.

This year with the COVID-19 Open Research Dataset (CORD-19), I had the chance to use the platform more consistently. Honestly, Jupyter notebooks and GUI-based development hasn’t been my preferred approach (Vim is often good enough for me). But over the last few months, I’ve been impressed with the capabilities of the platform. This article gives an overview of Kaggle Notebooks, the Kaggle API and demonstrates a way to build automated analysis pipelines.

Notebooks

Kaggle Notebooks

Kaggle Notebooks is a cloud-hosted Jupyter notebook environment. Notebooks can be built in Python or R. Notebooks execute within Docker containers and we can think of them as a bundle of logic. Notebooks can contain all the logic for a data analysis project or they can be chained together to build modular components. Notebooks can be publicly shared or kept private.

Notebooks have access to multiple CPU cores and a healthy amount of RAM. Additionally, GPUs and TPUs can be added, which can accelerate the training of deep learning models. The resources available are extremely impressive for a free service. Spinning up a comparable host on one of the big cloud providers is a sizeable cost.

Notebooks read data using a couple different methods. The main way is through datasets. Anyone with an account can upload data and create their own datasets. There also are a large number of publicly available datasets already on Kaggle. As with notebooks, datasets can be publicly shared or private. Notebooks can have one to many datasets as inputs. Additionally, the output of other notebooks can be used as input, allowing a chain of notebooks to be constructed.

Creating a new notebook

Kaggle Notebook Copy

Blank notebooks can be created using the “New Notebook” button shown in the previous image. Once a notebook is created, there will be an editor available to build logic. This example will copy an existing notebook to focus on methods to run notebooks.

We’ll use the CORD-19 Report Builder notebook. After following the referenced link, we can copy the notebook via the “Copy and Edit” button. This will create a notebook as <your Kaggle user name>/cord-19-report-builder. Please note that all the following links will show davidmezzetti as the user name, please substitute with your Kaggle user name.

Executing a notebook

Kaggle Notebook Edit

Copying the notebook will bring us to the interface to edit the notebook. From this screen, we can also add/remove inputs, modify settings and save/run a notebook.

The example above shows importing a utility script from another notebook. This is a powerful feature that allows sharing functionality across notebooks, preventing having to copy/paste boilerplate code. The example has two input datasets and one notebook as an input.

Clicking the “Save Version” button will execute the logic in the notebook and save a new version. There are multiple options:

Kaggle Notebook Save

If the notebook has been fully run while editing, “Quick Save” works well. Otherwise “Save & Run All” should be used.

Kaggle API

If we have a couple notebooks that we want to update occasionally as additional logic is added, what has been described is often good enough.

For more complex cases or frequent use, Kaggle has a full featured Python API available. The Kaggle API is available via PyPi:

pip install kaggle

The API has documentation on how to setup authorization and create access keys to enable usage. Once the API is setup, a simple example of running a notebook via the API is shown below:

kaggle kernels pull davidmezzetti/cord-19-report-builder -p cord-19-report-builder -mkaggle kernels push -p cord-19-report-builder

This runs two commands. The first one pulls the cord-19-report-builder notebook, stores it and it’s metadata in a directory called cord-19-report-builder. The second runs the notebook.

Once the commands above are run, we can go to Kaggle to monitor the progress of the job. The image below shows the versions screen where we can see a new version of the notebook is running. This screen can be brought up by clicking on the version text (in the right corner highlighted).

Kaggle Notebook New Version

Automated Pipelines

The Kaggle API is powerful and allows running notebooks outside the main web interface (along with many other features). Scripts can be built around it to enable more complex functionality and interactions with external processes.

With the CORD-19 dataset, the number of notebooks to support that effort grew to a point where there were 10+ notebooks that needed to be refreshed each time new data came in. On top of that, the dataset moved to a point where it was updated every day and the notebooks had dependencies on each other (i.e. one would need to run before another could run).

For this use case, it became apparent that full automation would be necessary. To enable automated pipelines, NeuML created the kernelpipes project.

kernelpipes can be installed via pip:

pip install git+https://github.com/neuml/kernelpipes

kernelpipes uses the Kaggle API to execute a series of notebooks sequentially or in parallel. Checks can be added to only enable running the pipeline if a source has been updated. Additionally, pipelines have a built-in cron scheduling feature to enable continuous execution. The following is a simple example pipeline in YAML.

# Pipeline name                       
name: pipeline
# Pipeline execution steps
steps:
- kernel: davidmezzetti/cord-19-report-builder
- status: 2.5m

Assuming the above content is saved in file named pipeline.yml, it can be run as follows:

pipeline.yml execution

This simple pipeline executes a notebook and checks for completion status every 2.5 minutes. Once the kernel is complete, the process will exit.

Basic pipeline configuration

name

name: <pipeline name>

Required field, names the pipeline

schedule

schedule: cron string

Optional field to enable running jobs through a scheduler. System cron can be used in place of this, depending on preference. One advantage the internal scheduler vs system cron is that new jobs won’t be spawned while a prior job is running. For example if a job is scheduled to run every hour and a run takes 1.5 hours, it will skip the 2nd run and start again on the 3rd hour.

Steps

check

check: /kaggle/dataset/path

Allows conditionally running a pipeline based on dataset update status. Retrieves dataset metadata and compares the latest version against the last run version and only allows processing to proceed if the dataset has been updated. If there is no local metadata for the dataset, the run will proceed.

kernel

kernel: /kaggle/kernel/path

Returns the kernel specified at /kaggle/kernel/path

status

status: <seconds|s|m|h>

Checks the status of preceding kernel steps at the specified duration.

Example durations: 10 for 10 seconds, 30s for 30 seconds, 1m for 1 minute and 1h for 1 hour.

More Complex Example

To give an idea of a complex use case, below is a full pipeline used for processing the CORD-19 dataset.

# Pipeline name
name: CORD-19 Pipeline
# Schedule job to run @ 12am, 10am, 3pm local time
schedule: "0 0,10,15 * * *"
# Pipeline execution steps
steps:
- check: allen-institute-for-ai/CORD-19-research-challenge
- kernel: davidmezzetti/cord-19-article-entry-dates
- status: 1m
- kernel: davidmezzetti/cord-19-analysis-with-sentence-embeddings
- status: 15m
- kernel: davidmezzetti/cord-19-population
- kernel: davidmezzetti/cord-19-relevant-factors
- kernel: davidmezzetti/cord-19-patient-descriptions
- kernel: davidmezzetti/cord-19-models-and-open-questions
- kernel: davidmezzetti/cord-19-materials
- kernel: davidmezzetti/cord-19-diagnostics
- kernel: davidmezzetti/cord-19-therapeutics
- kernel: davidmezzetti/cord-19-risk-factors
- status: 2.5m
- kernel: davidmezzetti/cord-19-task-csv-exports
- kernel: davidmezzetti/cord-19-study-metadata-export
- kernel: davidmezzetti/cord-19-most-influential-papers
- kernel: davidmezzetti/cord-19-report-builder
- kernel: davidmezzetti/cord-19-forecasting-articles
- kernel: davidmezzetti/cord-19-mice-trials
- kernel: davidmezzetti/cord-19-bcg-vaccine
- kernel: davidmezzetti/cord-19-digital-contact-tracing-privacy
- status: 2.5m

The example above runs 3 times a day. Before execution, it compares the version of the dataset to the previous run, if it’s unchanged, the process exits. Otherwise, notebooks are started up until a status step. At that point, kernelpipes will wait for all notebooks to complete before continuing.

In the above configuration, cord-19-article-entry-dates starts and kernelpipes will check every minute until it’s complete, then start the cord-19-analysis notebook and check for completion every 15 minutes. Once that is complete, the next series of notebooks are started in parallel and kernelpipes waits for all to complete and so on.

This pipeline refreshes each time the CORD-19 dataset updates without any user action. It effectively enables a series of “living” notebooks that continually are updated as new COVID-19 articles are added.

Conclusion

The Kaggle platform brings a lot to the table in a number of different areas, this article just scratched the surface (micro-courses also look great). Overall, I’ve been very impressed with the suite of capabilities and was able to engineer a complex, fully-automated data analysis pipeline. Keep these features in mind when building with Kaggle in the future!

--

--

Founder/CEO at NeuML. Building easy-to-use semantic search and workflow applications with txtai.