Upgrading Airflow with Zero Downtime
At Flatiron Health, we use Airflow to orchestrate the pipelines necessary to build the mission-critical datasets we use to accelerate cancer research. Airflow provides us a platform to author, schedule and monitor workflows. We extended the Airflow platform into a system which provides multi-tenant capabilities for resource and workflow management.
As our system’s adoption increased across the organization, Airflow workers were constantly busy with vital tasks, interfering with our ability to release updates that required worker restarts. The analysis and insight we provide to our network of oncology practices requires fresh data, making daily, successful data retrieval jobs a high priority.
The mission-critical nature of these tasks’ completion made it unacceptable for us to schedule downtime for common updates to the platform. In this post, we’ll explore our platform and considerations around finding a solution to rollout upgrades without interrupted tasks or downtime.
Our Airflow Infrastructure & Upgrade Problems
This diagram roughly depicts our Airflow infrastructure. Each team manages their own set of worker nodes, and we use a separate leader node for the webserver and scheduler. Our design allows teams to manage their own worker nodes and resources, because each team knows the needs of their tasks best.
A worker node consists of a Docker container running an Airflow worker process with a Celery executor back-end. We use systemd, a Linux process manager, to run the Docker container. The systemd unit file shuts down the container, pulls the new image, and runs the new Docker container. With this containerized design, upgrading our worker node is simple — we just have to push a new image (with new code or dependencies) and restart the container using systemd. Unfortunately, a consequence of this design was the interruption of running tasks in the Airflow worker.
An update to our system required restarting all of our Docker containers running Airflow worker processes. When systemd orchestrated a `docker kill` on restart, all Airflow worker Docker containers were forcefully shut down, interrupting any running tasks in the process. Interrupted tasks often resulted in agreement violations, failure of vital daily workflows, and other downstream consequences.
Considerations & Challenges
In our exploration for a solution, we established important criteria to help vet our ideas. By aligning on these criteria, we could quickly run through a check-list of the requirements for our solution:
- Running tasks must not be interrupted — when workers are restarted, any currently running tasks continue to run, and complete successfully.
- No downtime releases. Further scheduled tasks should be handled as soon as possible.
- Easy rollbacks — we should be able to iterate quickly, rolling back to an old version if needed, safely and easily.
- No stakeholder impact. The release process should be invisible to stakeholders — they should only need to be informed of critical changes that affect their workflows and should not be impacted during a regular release.
- New worker nodes are not created. With our current infrastructure, our host provisioning was still a relatively manual process — our hosts were spun up using the configuration management tool Ansible, but these processes were not in a state where we could reliably, programmatically trigger and configure new nodes.
- No dependency on a separate orchestration system. Our dockerized workers were easy to work with because they could be encapsulated and run anywhere. In order to maintain this, we required that our solution did not involve a separate, overarching system monitoring running tasks on the host. Our workers should be managing themselves.
Our implemented solution incorporated usage of the Celery API for fine-grained task management control, combined with a signal handler to provide a graceful shutdown of the Airflow worker. Using systemd, we initiate a set of commands to shut down our worker only when running tasks are complete, and simultaneously spin up a fresh worker to handle new scheduled tasks.
On `systemd restart`, systemd runs Docker kill, which then sends a TERM signal to the Airflow worker container. The Airflow worker container is sent a TERM signal, which is captured by a signal handler. The signal handler execution has two important functions that solve the challenges above:
- Stops worker from consuming & running any new tasks.
- Waits for currently running tasks to complete before shutting the worker down.
Our Airflow system uses Celery (with RabbitMQ), a distributed task queue to schedule tasks. Celery provides an API with monitoring and queue management endpoints. The trap handler uses these endpoints to orchestrate a graceful shutdown. The bash file below is the entrypoint run by the Airflow worker Docker container:
When our Docker container runs, it prepares a trap handler (1), and waits for a signal to be sent to the Airflow worker it runs in the background (2). The trap handler uses the `cancel_consumer` endpoint to stop our Airflow worker from consuming from the queue — this prevents the Airflow worker from picking up any new scheduled tasks to run (3). The Celery API also provides a `inspect active` endpoint, which returns a list of tasks still running by the specified worker. The trap handler uses this endpoint to monitor the running tasks on the Airflow worker, waiting for the list to empty before shutting itself down (4).
Systemd starts a new Airflow worker (with a unique Celery worker name) without waiting for the previous Airflow worker to shut down. The new Airflow worker automatically begins consuming from the Celery queue, picking up any new scheduled tasks, preventing any downtime.
With this solution, we avoid any task interruption — the Celery API stops the old Airflow worker from consuming new tasks, and the trap handler gracefully waits for task completion before shutting down the worker. Rollbacks, or multiple deploys, are simple, because the trap handler encapsulates the shutdown to the Airflow worker, without requiring any additional external orchestration. Stakeholders and teams working with the system see no impact to their workflows, as the immediate start-up of a new worker prevents any perceived downtime.
Our solution works well and fits the criteria we set out with, but it does come with a couple of drawbacks. While rolling out new versions is easy, frequent releases could be a burden on the statically resourced worker nodes. Running many Airflow workers on a single host could drain its resources. Additionally, our release process now has a dependency on Celery, so if we wanted to change our executor, we would have to make larger changes to the system to maintain this capability. Better infrastructure around host creation would enable us to forgo a release process on the individual node, in favor of spinning up a new worker node altogether with updated container versions. We’re investigating technologies like Kubernetes, to help orchestrate resource allocation, to diminish many of these concerns and to build a more scalable system.