Fixing SubDagOperator Deadlock in Airflow

Twine Labs
3 min readMay 17, 2018

--

At Twine, we use Airflow. It’s the only out-of-the-box data pipeline solution we trust for managing datasets as big and diverse as the ones we work with every day.

Unfortunately our initial deployment of Airflow was complicated by its deadlocking SubDagOperator. On the brink of implementing a costly, roundabout solution to the issue, we stumbled across an excellent profile of the issue courtesy of Bluecore’s engineering team.

Here’s a less technical explanation: your distributed Airflow deployment (e.g. Airflow + Celery) works like your local McDonald’s. Customers (tasks) place orders at the counter until every cashier is occupied; the remaining customers form a queue.

Here, the SubDagOperator is a parent whose children make up the operators of the sub-DAG. In real life, parents bring their kids to the counter and make sure each child’s order is properly placed. In airflow, the SubDagOperator leaves its children in line, and insists on occupying the cashier until every child’s order has been processed by another cashier. This becomes an issue when irresponsible parents occupy the entire counter, forcing children to wait in line until a parent leaves.

In restaurant terms, the solution is simple. We add a line to our McDonald’s just for parents. This ensures counter space for children and prevents parents from holding everyone up.

Convoluted analogies aside, here’s what we actually did in airflow. First we assigned each SubDagOperator to a special Celery queue:

Before = SubDagOperator(
subdag=dag,
task_id=dag_name
)
After = SubDagOperator(
subdag=dag,
task_id=dag_name,
queue="special_line_for_parents"
)

Next, in our deployment configuration files, we assigned a worker the sole responsibility of managing that queue.

Box1: airflow worker      ->   Box1: airflow worker -q default
Box2: airflow worker -> Box2: airflow worker -q subdag_queue

That’s it! This was an easy solution, but a hard problem to pin down due in part to airflow’s messy abstraction of sub-DAGs. Subgraphs are graphs themselves, so, we expect the abstraction of a subgraph to play by the same rules as the equivalent abstraction of a graph. Instead we have the SubDagOperator, which abstracts a graph, but behaves like a vertex.

Because we trust abstractions to explain how systems work under the hood, misleading abstractions like the SubDagOperator can eat up a lot of developer time assuring us we’re on the right track when we’re really not.

Despite these hiccups, we still love Airflow. The level of visibility and pipeline flexibility it provides make it well worth the set up time, especially when managing large and diverse datasets.

If this article helped you, or you rolled your own solution for sub-DAG deadlock, we’d love to hear from you. Reach out at engineering@twinelabs.com!

For the sake of completeness, it’s worth noting that this solution can still run into problems with nested SubDagOperators. Each nesting level takes up a slot in the sub-DAG queue until all its children, (and grandchildren, great-grandchildren, etc.) have completed, so you can run into a similar deadlock situation.

However, in practice, this isn’t a problem you’re likely to run into. For one thing, it’s pretty uncommon to nest sub-DAGs; if there is any nesting, it’s only two levels deep. For another, you can set the thread-level concurrency on the sub-DAG queue a lot higher than a normal worker queue — all a SubDagOperator does is wait for its children to finish, so it doesn’t take up many resources. As a benchmark, we have ours set to 32 threads per worker (celeryd_concurrency = 32).

--

--