Airflow DAG Deployment With S3

4 min readJul 4, 2022

Overview

Using Airflow to schedule and execute tasks is done through Python code. In order to get that code to the Airflow execution environment (webserver/scheduler/workers) it needs to be on a filesystem that the components can read from.

When running Airflow on Kubernetes two common ways this is done are by creating a Docker image with the DAG code or using a local filesystem.

This is a quick guide on how to do this with a third option-deploying the DAGs to S3 and having Airflow pick them up from there.

Getting Files onto S3

To get the DAG files onto S3, a quick and easy way of doing this is with the AWS-CLI as part of a CI/CD pipeline.

aws s3 sync — delete /repo/dags s3://airflow/dags

From S3 to Airflow

This part is a little bit trickier with the three Airflow components each needing access to the DAG files.

Workers

Airflow Kubernetes workers are the simplest to deal with because updates aren’t a concern. When the pod starts it needs the latest DAGs files to begin executing the task but once it is up and running it doesn’t need them after that.

To get the files from S3 that just means running an `aws s3 cp -recursive` before the worker starts executing.

Here’s a snippet that does that as an init container in the worker pod template:

Scheduler

For the Airflow Scheduler the same init container can be used for when the scheduler pod comes up.

To handle updating the DAG files when there is a change use a side-car container to run the following:

aws s3 sync --exact-timestamps --delete s3://airflow/dags /mnt/dags

It’s import to include the `exact-timestamps` flag. Without it, the sync only looks at file sizes and can end up ignoring some updates.

Here’s an abbreviated sidecar definition for the scheduler:

Even though the init container and side-car containers do the same thing, it helps to include the init container so the scheduler pod doesn’t start running until all the DAG files have been downloaded.

Webservers

With DAG serialization (Airflow v1.10.10+) the webserver pulls the serialized DAGs from the DB now.

Pre-1.10.10, the same init and side-car containers from the scheduler can be used for the webserver pods.

Costs

S3 API: this is usually the bulk of the costs. With most of it coming from the List/Get calls the Airflow workers make when they spin up and pull down files.
S3 storage and transfer: should be minimal since there’s only a set of Python files to store and copy.

A couple options for minimizing the costs further:
- Bake the infrequently changing code (utils, custom operators) into a Docker image, then only have DAG and task definitions deployed through S3.
- Dynamically figuring out what files workers need to get based on the task they’ve been assigned.

Alternatives

Bake DAGs into a Docker image
This has the benefits of tidying things up into a single artifact-all of the DAGs with their dependencies are in a self contained image. A couple of the drawbacks for this are longer deploy times to build and upload the image as well as having to roll the scheduler/webserver pods to get the changes deployed.

Git Sync and Shared Filesystems
This is one of the other options supported by the Airflow Helm Chart. In this setup git-sync is used as an init-container to pull the current DAG definitions from a git repo. As the git history grows and the number of Airflow workers being run increases it can start to strain GitHub/GitLab in the form of API throttling and longer start up times.

One workaround to this is to use a shared filesystem on Kubernetes such as AWS EFS or GCP Filestore and only have one place where the git-syncs are run. This prevents API throttling and ensures everyone sees the same files on disk.

Some of the drawbacks with a shared filesystem solution:
- If not already in place, a new Kubernetes provisioner needs to be setup and deployed.
- Manage Git(Hub/Lab) permissions in addition to AWS permissions.
- Need to keep an eye out for Git(Hub/Lab) API and shared filesystem throttling.

S3 Fuse
It’s possible to mount the S3 dags through a Fuse filesystem on the Airflow pods. In practice it isn’t a good fit for this usecase because of performance reasons and Airflow’s disk access patterns. It was taking over a minute for the scheduler to parse/load all the DAG files. Similarly, workers were spending a long time to start executing.

Wrap up

Hope this helps and gives everyone yet another option for deploying their Airflow DAGs.