Orchestrating Airflow tasks with Docker Swarm

Scaling your Cron jobs to multiple nodes

Published in

Agoda Engineering & Design

5 min readOct 30, 2019

It’s quite common for the word “Cron” to pop up in a programmer’s mind when something needs to be run regularly. Whether it is a daily report generation, a cleanup script or a trigger for data flow in your pipeline — a line added to the crontab file and voila, your script will now run every day at 4 PM!
While maintaining the data pipelines at Agoda, we realised that as the number of scheduled tasks increases, crons become unmanageable due to various reasons.

WHAT IS THE PROBLEM?

Single point of failure: Almost everywhere crons are running on a random application server because the developer used to like that server a lot. This means that if this server goes down, along with the application, your cron scripts go down (for no good reason). It’s simply not the responsibility of that random server to schedule your scripts.
Cron failures are not easy to monitor/debug: Crons do not have any good way for failure alerting. And even if you somehow got to know that your script has been failing for the past 1 month, there is no easy way to debug it (yes, even I hate checking it through /var/mails!).
Limited by single node’s resource: Crons aren’t scalable either. You cannot run tasks simultaneously if they take more resources than that single node — you cannot scale your scripts with multiple nodes.

HOW DO WE SOLVE IT?

Solution for managing Crons — Airflow

Task scheduling has always been an interesting problem in Computer Science and many tools have emerged to solve it. Among them, Airflow has particularly become popular because it allows programmers to define these tasks as (Python) code and instantiate them dynamically. Also, it is filled with useful features like automatic retry on failures, authentication and integration with various other tools.
Airflow also ships a web portal with a beautiful UI which makes monitoring & debugging very easy. (This solves problem #2).

Solution for scaling tasks — Container Orchestration

Soon after Docker revolutionised the manner software was packaged, orchestration tools like Docker Swarm, Kubernetes and Mesosphere became developer’s favourite tools to scale applications. They would ensure that the application is highly available and deploy it wherever resources are available in your “pool” of servers. So if one of your application frees up resources on one of your servers in the pool, another application can re-use these resources i.e. more work with fewer servers! :) (This solves problem #3).

With orchestration tools making scaling so easy, it’s quite natural to think of scaling Airflow tasks (aka your cron jobs) with them by running them as containers.

Solution for high availability — Airflow+Container Orchestration

As mentioned earlier, one of Airflow’s features is automatic retries on failures. This means that if you tried to run some job and it didn’t complete successfully, Airflow would try to run it again (according to how many retries you have configured). Coupled with orchestration, this can be useful to achieve resiliency against unexpected node failures (This solves problem #1). The below diagram illustrates how:

THIS IS AWESOME — HOW CAN I DO THIS?

Before we dive into the code there are some terms that we should get ourselves familiar with:

Airflow operators

Airflow executes its tasks via “operators” — they communicate to Airflow what has to be executed and how. For instance, the PythonOperator allows you to run Python functions at your desired schedule, a BashOperator would allow you to execute bash commands, etc.

Docker Swarm (orchestration tools)

In my team, we use Docker Swarm to manage our cluster of nodes and orchestrate our containers on them. Docker Swarm (or simply Swarm) is Docker’s native clustering engine. It is an open-source container orchestration platform popular for scenarios which require fast deployments and simplicity.

Other than Docker Swarm, Airflow also supports Kubernetes and Mesos in a very similar fashion.

DockerSwarmOperator

To achieve the capabilities mentioned in the previous section we wrote an Airflow operator for Docker Swarm, named the DockerSwarmOperator. This operator enables Airflow to communicate with Docker Swarm to run containers.

For those curious about the internals of the operator, Docker runs containers as a part of Docker services in swarm mode. So this operator actually starts a Docker service with 1 container (replica) based on the given Docker image and runs your desired command in that container. Once the command gets executed the container gets removed by Docker. The service can also be cleaned up automatically by setting the auto_remove parameter of the operator. Hence, the operator is able to execute a command (run your application) as an ephemeral docker swarm service.

The operator (which is now available in Airflow’s master branch) will be released as a part of v2.0 and we will be referring to the same in the examples below.

USAGE CODE SNIPPET

The following code would allow you to run your desired container on Docker Swarm via Airflow (and of course, scheduling it):

Airflow DAG code to run containers in Docker Swarm

Here is what the above code is doing:
Configuration — Specify the configuration like do you want to send alerts, what is the task schedule, etc.
Create a DAG — The most important thing about DAG over here is that it is the entity which knows when to trigger your task (i.e. the schedule). You can go through the official documentation to know more about DAGs.
Create the task — This is the part which makes Airflow say “Please run task t1” to Docker Swarm via the DockerSwarmOperator. Here the task is to sleep for 10 sec.

And that’s it. Now you have scalable, manageable Cron jobs!

BONUS TIP — DEPLOY AIRFLOW VIA THE ORCHESTRATOR!

Looking carefully, there is still a single point of failure in this system — what if the node running Airflow itself goes down?
To make Airflow highly available (HA) we simply ran Airflow in a Docker container via (the same) Swarm cluster, so that if Airflow goes down Swarm restarts it on another node. This is what it looks like:

Orchestrating Airflow via Docker Swarm for HA

Conclusion

We built a scalable system for task scheduling which is resilient towards node failures. The system is built upon Airflow and Docker Swarm.
If you’d like to try out this feature in action, I have written another article to serve as a step-by-step guide to set this up— https://medium.com/@akkidx/setting-up-airflow-to-run-with-docker-swarms-orchestration-b16459cd03a2