Workflow automation using Apache Airflow 2.0

Sai Charan Chinta
Unibuddy
Published in
4 min readApr 20, 2022

We use Celery to run asynchronous tasks at Unibuddy and it performed fairly well for some time. As we evolved and started supporting more use-cases we needed a robust mechanism to stitch together tasks into processes(workflows), track individual tasks, scale up to run more of them in parallel and also needed a reliable way to run jobs that are really long running.

So we started looking for alternatives with these objectives in mind

  • Track individual tasks and overall status
  • Supports long-running tasks
  • Handle failures gracefully
  • Self managed and automated scaling
  • Seamless integration with multiple internal & AWS services

We decided to experiment with AWS Step Functions and Apache Airflow. You can know more about our experience with AWS Step Functions here

Airflow is a platform to programmatically author, schedule and monitor workflows.

Workflows are DAGs in airflow

A DAG — or Directed Acyclic Graph — represents a workflow in Airflow. It is a collection of tasks in a way that shows each task’s relationships and dependencies. You can have as many DAGs as you want, and Airflow will execute them according to the task’s relationships and dependencies.

This is how they look like:

Writing a DAG

A DAG is defined in a Python script, which represents the DAGs structure (tasks and their dependencies) as code. For example, a simple DAG could consist of three tasks: A, B, and C. It could say that A has to run successfully before B can run, but C can run anytime. It could say that task A times out after 5 minutes, and B can be restarted up to 5 times in case it fails. It might also say that the workflow will run every night at 10pm, but shouldn’t start until a certain date.

Each task, ideally idempotent, within a workflow is represented by an operator. Operators determine what actually executes when your DAG runs. Airflow has a very extensive set of operators available, with some built-in to the core or pre-installed providers

In the following code, we define two tasks(with DummyOperator) which do nothing but denoting the start(t_start) and end(t_end) of the workflow. Assuming we already have a lambda deployed, we then define a PythonOperator — which calls an arbitrary Python function — which triggers a lambda with some payload.

The relationships within a DAG are defined by “>>”. The pictorial representation of the above DAG looks like this:

workflow interpreted by the DAG

You can invoke the workflow thorough the UI, over REST API or by a schedule you attach to the DAG. After simulating some of our use cases with airflow like report generations through lambdas and long running migrations through aws fargate, here is what we found

Pros

  • Great UI
  • Robust integrations — airflow supports integrations with different services in AWS, GCP, etc. In fact airflow provided hooks which made it possible to interact with AWS services like lambda & fargate.
  • Pure python — most people at Unibuddy already know the language
  • Code sharing — airflow allows to write common plugins/libraries that can be shared across DAGs
  • Open source & community
  • Easy local development

Cons

  • Static workflows — many of the processes we have needed dynamic configurations. We could achieve them through airflow with a more complicated setup but it’s not designed for that and the process seemed really hacky
  • Lack of data sharing — airflow does provide some level of data sharing but its recommended to use external data sources
  • Learning curve
  • No versioning of workflows
  • Cost — it seems like airflow will continue to run the task itself even when its idle by waiting for the external system to respond thereby mounting up costs

Conclusion

Although airflow achieves most of the objectives we set in the beginning, we decided not to move forward with it. It works best as an orchestrator offloading most of the heavy lifting to external systems. We instead chose Step Functions because it also meets our objectives while providing flexibility in defining dynamic state machines, ease of data sharing, incredible scaling and easy provisioning of the functions with AWS CDK.

This POC was done in two weeks of personal project time given to every engineer at Unibuddy. If you want to work at Unibuddy, we are hiring.
Check out the link — https://grnh.se/3583cca53us

--

--