How to Build Machine Learning Pipelines with Airflow & Papermill

Learn to scale your machine learning workflows at will.

Published in

AI³ | Theory, Practice, Business

5 min readApr 7, 2020

I work at Yodo1, a leading mobile game platform company where we value streamlined efficiency and achieve it through automation. In this post, I am going to introduce two tools I have used on the job that can help you build efficient and scalable machine learning pipelines.

1. Papermill

We all love Jupyter Notebooks — interactive programming is fun and powerful — but it is difficult to use on its own in production. Stuffing everything into a single, massive notebook or running multiple notebooks with hacky shell scripts are not scalable practices.

However, we can refactor our notebooks with the help of Papermill, a tool that can parameterize the execution of notebooks.

To get started, simply create a cell at the top and add the parameters tag; then you can pass the parameters as you run the notebook with Papermill.

papermill local/input.ipynb s3://bkt/output.ipynb -p alpha 0.6 -p l1_ratio 0.1

As you can see, neither the input nor the output has to be local. Additionally, you can provide a cloud storage location like AWS S3 or Google Cloud Storage, which is super convenient.

And of course, you can run notebooks like we do at Yodo1, by invoking execution with code:

import papermill as pmpm.execute_notebook(input_path=’yodo1_games.ipynb’,output_path=’/data/yodo1_games_output.ipynb’,parameters=params,progress_bar=True, report_mode=True)

Sure, this code is neat, but how can we run it alongside scrapers and data cleaning jobs? Should we add another shell script to the mix?

Nah. We can use a scheduler for that.

2. AirFlow

Apache Airflow is an open-source workflow management platform. It allows you to define workflows using DAGs.

In Airflow, a DAG — or Directed Acyclic Graph — is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A DAG is defined in a Python script, where its structure (including tasks and dependencies) are represented as code.

A DAG represents a workflow and is composed via one or many AirFlow Operators. There are lots of predefined operators from which to choose; alternatively you can create your own custom operator.

Let’s look at a DAG in action. Here is a simple example to demonstrate how to construct one with operators.

from datetime import timedelta

from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.dates import days_ago

args = {
    'owner': 'airflow',
    'start_date': days_ago(2),
}

dag = DAG(
    dag_id='example_bash_operator',
    default_args=args,
    schedule_interval='0 0 * * *',
    dagrun_timeout=timedelta(minutes=60),
    tags=['example']
)

run_this_last = DummyOperator(
    task_id='run_this_last',
    dag=dag,
)

# [START howto_operator_bash]
run_this = BashOperator(
    task_id='run_after_loop',
    bash_command='echo 1',
    dag=dag,
)

run_this >> run_this_last

for i in range(3):
    task = BashOperator(
        task_id='runme_' + str(i),
        bash_command='echo "{{ task_instance_key_str }}" && sleep 1',
        dag=dag,
    )
    task >> run_this

# [START howto_operator_bash_template]
also_run_this = BashOperator(
    task_id='also_run_this',
    bash_command='echo "run_id={{ run_id }} | dag_run={{ dag_run }}"',
    dag=dag,
)
# [END howto_operator_bash_template]
also_run_this >> run_this_last

if __name__ == "__main__":
    dag.cli()

The code is pretty straightforward. After we define the DAG and its operators, DAG dependencies are declared via the >> symbol. Then, you can use the Xcom interface to exchange messages.

With the abstraction of DAGs, you can build complicated workflows and you don’t have to worry about parallelism and scheduling. AirFlow takes care of that.

AirFlow Kubernetes

In production, we run AirFlow in Kubernetes clusters. This can easily be installed with helm. To install with customizations, you have to provide your own values.yaml, like this:

helm install — namespace “airflow” “airflow” stable/airflow.

Default AirFlow Docker images do not come with database libraries; however, they are necessary for dealing with databases in AirFlows DAGs. Therefore, you have to build your own custom image like this one.

Another thing worth mentioning is the AirFlow Kubernetes Operator. Kubernetes is able to abstract away the infrastructure and scaling problems on its own. The AirFlow Kubernetes Operator makes AirFlow scheduling more dynamic: you can run workflows and scale resources on the fly. It’s like adding a jet engine to the falcon.

KubernetesPodOperator(namespace='default',
                      image="python:3.6",
                      cmds=["python", "-c"],
                      arguments=["print('hello world')"],
                      labels={"foo": "bar"},
                      name="passing-test",
                      task_id="passing-task",
                      get_logs=True,
                      dag=dag)

This bridges the gap between task scheduling and resource management. Resources are released after the task is done, which is pretty important since machine learning tasks tend to consume a lot of resources.

Check out this excellent insightful article by the guys over at Bloomberg, who created and contributed the Kubernetes operator to the AirFlow codebase.

AirFlow Papermill

There is a predefined Papermill operator in AirFlow. Unfortunately, it is buggy and has not been fixed in current stable AirFlow versions (1.x). However, you can wrap your Papermill task within a Python or Bash operator instead.

AirFlow Pitfalls

It’s not always rainbows and butterflies, at least not when dealing with dynamic languages like Python. Every choice comes with a tradeoff. Unfortunately, some tradeoffs are not well documented.

Before potentially hitting a wall with some unfathomably weird issues, check out common pitfalls on the official AirFlow wiki. It will save you time in the long run. (I wish I had done so before starting my project.)

In Summary

I’ve introduced how to use AirFlow and Papermill to facilitate machine learning pipelines. With the help of these tools, you can build robust and scalable machine learning workflows.

One more thing worth mentioning that can supercharge your machine learning pipeline is Weights & Biases, a platform with features like experiment tracking, hyperparameter optimization, and model management. It’s not fully open-source, but the free version offers a lot for tuning machine learning models.

In the next post, I will get back to my Anomaly Detection series. To keep this post concise, I’ve omitted quite a few background details. Please leave a message if you have any questions.

Thanks for reading! If you enjoyed this article, please hit the clap button as many times as you can. It would mean a lot and encourage me to keep sharing my knowledge.

Feel free to share your questions and comments here, and follow me so you don’t miss the latest content!