In Praise of Digdag, an Alternative to Airflow

Scott Arbeitman
3 min readJan 12, 2019

--

At Culture Amp, we’ve been customers of Treasure Data for about a year. We’ve been able to get some great momentum around product analytics with their stack of tools, which includes solutions for data ingestion, data querying, workflow management, and data export.

Treasure Data are very good open-source citizens, and many of the tools that power their stack are open-sourced. Their most popular is probably fluentd, but their bulk data movement tool, Embulk, is pretty good. And as I’ll describe below, so is their workflow management tool, Digdag.

As we looked to move some of this infrastructure internally using mostly AWS tools (Spark on EMR, Presto via Athena, S3 for storage), Amazon didn’t offer a compelling solution to data pipeline orchestration. Step Functions and Simple Workflow Service were briefly considered, but we decided to opt for a cloud-agnostic solution with a more familiar programming paradigm.

The solution de jure for workflow orchestration is Airflow. My line has repeatedly been that when starting a new project, Airflow is the sensible default. To the extent data engineer is “mainstream”, so is Airflow. Google Cloud even offers a hosted Airflow: Google Cloud Composer.

As we began executing on this project, we spun up Airflow and got to work. Another team had already set up Airflow on Fargate, so we jumped in and started writing code. We noticed a few things:

  • Airflow DAGs are verbose and hard to read. I chalk some of this up to my discomfort with Python (my background is more with Ruby), but also the way tasks are wired together in Airflow make it difficult to understand what is happening at a glance.
  • Airflow’s included operators are buggy and immature. We ended having two data engineers sending pull requests to Airflow to fix a couple of bugs. While having Airflow contributors is something we are proud of, it reduced our confidence in the codebase.
  • Sometimes, the value of a tool is its strong opinions that it imposes, helpful constraints on what you do with it. With Airflow, we were getting mixed messages. Should Airflow handle data processing with oddly specific Source-To-Destination operators, or should the processing be offloaded to another tool like Spark (our preference)?

Having worked with Treasure Data’s Digdag for a nearly a year, something didn’t seem right with Airflow. Digdag addresses the three key flaws of Airflow:

  • Dags are written in YAML and are easy to understand at a glance.
  • Digdag operators, while scant, are reliable and simple.
  • Digdag’s architecture and included operators strongly imply you should handle your data processing in another tool.

We do give up quite a lot with Digdag compared to Airflow. Among them:

  • Airflow is way more adopted than Digdag so it is easier to find help, plugins and tutorials from the community.
  • Airflow’s web UI is far superior to Digdag’s.
  • Documentation on Digdag is poor, especially on how to configure your server and build custom operators.

Nevertheless, after making the decision to use drop Airflow and use Digdag, I think the trade-offs have been worthwhile. The bottom line for us is that these DAGs should be as dumb as possible, offloading the vast majority of interesting work to systems designed to handle them. The tool should be optimised for that task, and for us, Digdag makes it all much easier.

--

--