Batch orchestration on Azure flowchart

Coussement Bruno
datamindedbe
Published in
5 min readApr 13, 2022

Managed solutions vs building it yourself

Image by author

Batch data lake architectures often require a component that orchestrates data pipelines and ingests flows. On Azure, the go-to service for this is Azure Data Factory (ADF). Although this is a great option to start with, there is more to be said.

Most orchestrators will provide a way to trigger or run a task (which can be anything) on the same or another platform. Tasks can be linked together in a directed acyclic graph (DAG). That’s where the similarities end.

The ideal orchestration platform commonly should have:

  • off-the-shelf connectors to all (possible future) services you want to run or schedule
  • DAGs defined as code or YAML
  • no management overhead of the platform
  • easily install or deploy
  • easy access to task logs
  • easy alerting and notification in case of failed DAGs
  • support for the authentication and authorisation model used by your organisation
  • a clear overview UI that shows DAG success rate, last run time etc. over all DAGs

There is no option on Azure today that offers all of this. You will have to make trade-offs. The flowchart below focuses on an essential aspect, in my opinion: overhead cost and ease of integration.

Flowchart made by the author

I will comment a few words about the different options below.

Azure Data Factory

Azure Data Factory should be your first go-to option if the number of DAGs your team will manage is low (<10), simple (no dynamic DAG magic), and uses services for which it has a connector. The good news is that there are enough connectors to cover most cases. It is child’s play to create an ADF workspace, make your first DAG, and schedule it to run every night at 2 am. You don’t need to be a senior engineer for this.

There are a few considerations to be made:

  • Suppose you need to connect to an unsupported service, platform, or system. In that case, you start a journey of creating and maintaining a custom web activity, webhook activity, Azure function, custom container on Azure batch etc. There goes the “off the shelf” experience.
  • Everything is YAML in ADF, which Git can track. Combine this with a CICD flow to promote a DAG from the dev to the production environment, and you got yourself already a time saver in the long run. But here’s the thing: this Git repo is polluted with ARM templates, weird branch names, and an odd CD strategy. Did you want to include your ADF YAML DAGs in your actual data pipeline code? Good luck. The DAG code and pipeline code will need to live in separate repositories.
  • I quickly lose the overview when the number of DAGs in a single ADF workspace grows. You could create a PowerBI dashboard for this, but that’s not trivial either.
  • Creating a single complex DAG (>20 tasks, many dependencies) can tax your brain’s working memory, whether you click it together in the visual editor or type YAML blocks in an IDE. That’s why generating DAGs programmatically is the only sensible way when it starts to become complex.

Managed Airflow

Even though there are a few extra steps for installing a managed Airflow on Azure, it will alleviate the main pain points of ADF. Using managed Airflow like Astronomer allows for the same amount of integrations. In Astronomer, the latest Airflow features and fixes are immediately available. Astronomer also scales to thousands of DAGs or tasks and makes it easy to set up a CICD flow: upload a DAG to a specific container in a storage account corresponding to a particular Airflow DAG collection.

The main drawback of choosing Airflow is setting it up and maintaining yourself. It has multiple moving parts: the UI, the scheduler, the executors (workers), the metadata DB, and the message broker (needed if multiple workers and schedulers). If you are aiming for the fully fledged production-grade setup, you are going to need all of this. It’s easy to fall in the trap of thinking that this is manageable with a few people. I’ve seen so much wasted human effort for functionality that can be basically obtained off-the-shelf.

Why Airflow in the first place? Simply because it is just one of the most popular orchestration tools available today. Before version 2, it had its problems, which veered people away towards other opinionated Orchestrators like Prefect. As long as you use how it has been designed for, you’re good. A good post on this is https://blog.locale.ai/we-were-all-using-airflow-wrong-and-now-its-fixed/.

So basically, you lose two things compared to ADF: a little bit harder to install, and DAGs must be defined as code instead of drag-and-drop. The pertinence of these implications will depend on your situation.

Both AWS and GCP also offer managed Airflow. Compared to Astronomer, they often lag in the Airflow version but are easier to install in that specific cloud. If you want to go hybrid cloud, this might also be a good option.

Airflow on AKS

Ok, so you’re a Kubernetes (k8s) kind of person? Knock yourself out with the Airflow Helm Chart. I’ll see you in a year when the scalable production-tested multi-team, multi-environment company-wide setup is ready.

Joke aside, deploying a basic Airflow on a k8s platform is fairly simple if you’ve done it before. If that is all you need, and you don’t want any vendor lock-in, this is the best way forward. Just know that if you walk the dark path of managing infra and platforms yourself, it will become more challenging because of technical debt pilling up.

Airflow on ACI

If all you want is a basic Airflow that is being updated to the latest version, and don’t want to manage k8s clusters and deployments, then have a look at the Bitnami Docker images. You can quickly deploy them on Azure container instances. Just know it is not made for performing heavy work. You’ll also need a few more Azure services to actually make it work as shown in this blog post.

Make sure you’re using a tool like Terraform to deploy instead of clicking it together through the UI. That will save you some pain when updating and creating new environments in the long run.

Airflow via Azure Marketplace

The last viable option is deploying Airflow through an ARM template offered on the Azure Marketplace. Super easy to set up, not many pricing options, production-ready. Note that you’re running it on bare VMs, with all the hassle associated with it.

This option is more expensive in terms of cloud costs compared to deploying it yourself on ACI or AKS. On the other side, you simply deploy it with a click of a button. Keep in mind that because of the VMs and other low-level services you are glueing together, it is more complex to manage (keep up to date) in the long run. That’s why I would only use this option as a temporary solution to any of the above.

Bottom line

Go managed. Don’t try to do it yourself if you can avoid it. Your future self thanks you.

Further

If you are looking for active guidance in batch or streaming architecture on Azure, feel free to contact me or https://dataminded.be.

--

--