Unsticking Airflow: Stuck Queued Tasks are No More in 2.6.0

Consolidating logic for stuck queued tasks and making this pesky bug a thing of the past

RNHTTR
Apache Airflow
3 min readApr 19, 2023

--

Hi there! I’m Ryan, an Airflow engineer on the Customer Reliability Engineering team at Astronomer, the commercial developer for Apache Airflow. In this blog post, I’ll share my recent contribution to Airflow, which tackles the issue of stuck queued tasks starting in Airflow 2.6.0. Let’s dive in!

Introduction

Airflow has long been a powerful and flexible platform for orchestrating complex workflows. However, workflows have a nasty habit of getting stuck, especially when using the CeleryExecutor. In this blog post, I’ll explore how Airflow 2.6.0 addresses this issue and consolidates a few executor-specific configuration settings for a more reliable Airflow experience.

Oiling the Gears: Airflow 2.6.0 Keeps Tasks Moving

The Stuck Queued Task Phenomenon

In Airflow, tasks are queued for execution based on their dependencies and scheduling constraints. A task’s lifecycle typically progresses from schedueled to queued to running. First, the Airflow scheduler determines that it’s time for a task to be run and any other dependencies (e.g. upstream tasks have completed). At this time, the task enters the scheduled state. When a task gets assigned to an executor, it enters the queued state. When the executor finally picks up the task and a worker starts performing the task’s work, the task enters the running state. Sometimes, tasks get stuck in the queued state. This can be intentional (e.g. parallelism has been reached) or unintentional when something fails between the executor and scheduler.

A great way to observe this phenomenon is using Airflow’s GANTT chart view. Take a look at the next two GANTT chart screenshots. The first represents the normal execution of tasks on Airflow, and the second shows the transform task stuck in queued.

Normal, healthy GANTT chart
GANTT chart showing a long gap between task execution, indicating a task stuck in queued

With CeleryExecutor in particular, stuck queued tasks have historically been a common occurrence despite the celery configuration setting AIRFLOW__CELERY__STALLED_TASK_TIMEOUT. While stalled_task_timeout resolved most problems with tasks stuck in queued, some would nonetheless sneak by and remain in the queued state for hours until the task’s scheduler died and a different scheduler could subsequently “adopt” the task.

At Astronomer, we observed data that suggests that many Airflow users are plagued with tasks stuck in queued, even though most don’t notice. I spent quite some time alongside my colleagues trying to reproduce the bug to no avail. In a pinch, I wrote a DAG that queried the Airflow database for tasks stuck in queued for an acutely affected Airflow user. The query looked something like this:

SELECT
dag_id,
run_id,
task_id,
queued_dttm
FROM
task_instance
WHERE
queued_dttm < current_timestamp - interval '30 minutes'
AND state = 'queued'

It eventually became clear that this is the most logical way to detect stuck queued tasks; querying the Airflow database from the scheduler is far more intuitive and simple than trying to implement bespoke logic for each executor. So, that’s how Airflow will resolve tasks stuck in queued moving forward:

  1. Query the database for tasks that have been queued for longer than scheduler.task_queued_timeout.
  2. Send any such tasks to the executor to handle failing the tasks.

CeleryExecutor and KubernetesExecutor are the only executors for which this new setting, scheduler.task_queued_timeout, is supported.

Consolidating Configuration Settings

Previously, three different configuration settings were all more or less responsible for doing the same thing:

  1. kubernetes.worker_pods_pending_timeout
  2. celery.stalled_task_timeout
  3. celery.task_adoption_timeout

By deprecating these settings and moving the logic to detect stuck queued tasks to the scheduler, Airflow now provides a single mechanism (scheduler.task_queued_timeout)to detect and handle stuck queued tasks, regardless of the executor used.

Conclusion

Airflow 2.6.0 deprecates kubernetes.worker_pods_pending_timeout, celery.stalled_task_timeout, and celery.task_adoption_timeoutinto a single configuration, scheduler.task_queued_timeout and solves problematic tasks stuck in queued once and for all!

I encourage Airflow users to upgrade to version 2.6.0 and take advantage of these improvements. We at Astronomer are committed to the ongoing development and enhancement of Airflow, and your feedback plays a huge role in that, so sound off in the comments below!

--

--