AIRFLOW / BEST PRACTICES

Airflow 2: Managed vs Non-Managed Solution?

Based on your requirements, which flavour of Airflow should you go for? Astronomer? AWS MWAA? GCP Cloud Composer? Or a self-managed Install?

Paul Fry
Geek Culture
Published in
9 min readJun 6, 2022

--

Note: the views described below are my own and don't represent my employer.

Agenda

  1. Background
  2. Options Available for Airflow 2 Installation
  3. Assessment of Options Against Evaluation Criteria
  4. Summary of the Airflow 2 Options
  5. Recommendation

1. Background

Version 1 of Airflow reached End of Life (EOL) in June 2021 and is no longer supported by Airflow (see: Airflow supported versions | airflow.apache.org). As a result, we needed to understand the options available for setting up and installing Airflow 2 based on our current needs and pain points.

The result — not a straightforward decision! To help aid our selection process, we established our critical business criteria for the evaluation after lots of trial and error. This post aims to share our findings to help your decision-making.

2. Options Available for Airflow 2 Installation

Four options are available for setting up Airflow 2 for a production-grade environment. Where the first three options listed below are managed-service offerings:

  1. Astronomer: 'Any cloud' managed service for Airflow
  2. AWS MWAA: AWS-managed service for Airflow
  3. Google Cloud Composer: GCP-managed service for Airflow
  4. Self-managed Airflow Install

And the fourth option is to install and configure your Airflow environment on a Kubernetes cluster.

Evaluation Criteria

As mentioned, establishing the criteria wasn't straightforward. Each option has pros and cons, and the requirements will be unique to your scenario. However, to help shape the evaluation, we first established our critical business criteria for the review. These were as follows:

Criteria 1: ML workload requirements? Is the solution capable of supporting Machine Learning/Data Science workloads? Does it need to?

Criteria 2: Flexibility — how much can the managed-service offering be customised? What Airflow executor types are available, i.e., are the K8s/CeleryK8s executors available? Are Airflow releases frequently made available?

Criteria 3: K8s expertise — how much K8s in-house knowledge do you have to support any potential solution? What is their bandwidth?

Criteria 4: Value proposition — what does one Airflow implementation offer that others don't?

Criteria 5: Setup effort required — would the value justify the complexity of the setup needed for the solution?

Criteria 6: 'Local' Development environment — can a standalone version of the environment be easily configured and installed on a user's machine?

3. Assessment of Options Against Evaluation Criteria

So without further ado, let's see how the four Airflow options stack up against the evaluation criteria.

Criteria 1: ML Workload requirements? Is the solution capable of supporting Machine Learning/Data Science workloads? Does it need to?

Airflow Executor Types

The choice of executor type is critical to the workloads you want to orchestrate using Airflow. Executors significantly impact your task throughput, scalability, and maintenance. The benefits of the different types of Airflow executors are summarised below:

  • Speed of task execution: the K8sExecutor takes more time since it needs to spin up a pod, assign it to a node, run a docker image, etc. So it may not be the best if you have a lot of small tasks.
  • Simplicity: the LocalExecutor is the simplest one. Very easy to set up and run but doesn't scale well. The K8sExecutor is more challenging since it relies on Kubernetes.
  • Isolation: the K8sExecutor creates a pod for each task. This avoids dependency conflicts and increases the reliability of the task execution.
  • Custom resources: if you genuinely want to manage your resources, the K8sExecutor is the way to go. You can assign resources (CPU, memory, etc.) to a task according to its specific needs.

As a result, if you're looking to use Airflow to orchestrate ML pipelines, you need to use the Kubernetes (or CeleryKubernetes) executor. And as shown below, given they're the only one offering the availability of the K8s or CeleryK8s executors, only Astronomer or a self-managed Airflow are the only viable options for using Airflow to run MLOps pipelines:

Criteria 2: Flexibility — how much can the managed-service offerings be customised? What Airflow executor types are available, i.e., are the K8s/CeleryK8s executors available? Are Airflow releases frequently released?

Ability to select the executor type?

Shown below is the same table, listing which offerings allow what executor types:

As you can see, only Astronomer & the self-managed install options allow you to select the type of executor used. As a result, if you're looking to use Airflow to orchestrate ML pipelines, you'll need to use either of these two options.

The very nature of a managed service results in platform engineers having limited options available for configuring installs per their environment requirements.

Frequency of Airflow Version Releases

The other thing worth noting here is the frequency of releases for the managed service offerings:

  • Astronomer is top of the shop here, offering very regular releases
  • Cloud Composer offers regular/frequent versions, but not as many as Astronomer.
  • AWS MWAA — pretty poor in this department. At the time of writing, it only offers three versions of Airflow as part of its managed service offering: v1.10, 2.02 & 2.2.2

As you can see, the very nature of a managed service results in limited flexibility in configuring the installation. In contrast, option 4 (self-managed install) provides total flexibility as a non-managed service option.

Criteria 3: K8s expertise — how much in-house K8s expertise do you have to support any potential solution? What is their bandwidth?

A bit of a no-brainer; only option 4 requires K8s for the most part. GCP Cloud Composer provides great flexibility in how K8s is deployed for your Airflow cluster. Therefore, it is also (in part) required for option 1.

Criteria 4: Value proposition —what does one Airflow implementation offer that others don’t?

GCP

  • If you're not already using GCP, the value provided by GCP Cloud Composer over the other options isn't enough to justify the initial GCP account setup required (infra/networking/IAM).

Astronomer

  • 'Any cloud' managed solution provides flexibility to pivot the Airflow implementation from one cloud provider to another within the Astronomer ecosystem.
  • Astronomer, on paper, seems to be a great candidate, providing similar levels of flexibility. However, enterprise-level support was an essential requirement for our use case, and we required assurances that the service offered in APAC (from overseas resources) would meet our needs. Without testimonials in APAC within multiple industries demonstrating this, we couldn't entertain this option.

AWS MWAA

  • Remaining within the AWS ecosystem significantly simplifies integration considerations, in particular, secrets and logs management.
  • However, the managed AWS Airflow service itself is a compromised solution. See 'Criteria 5: Setup effort required' below.
  • Furthermore, the Airflow version provided by AWS MWAA doesn't allow us to develop an Airflow RBAC implementation.

Self-managed Airflow Install

  • A self-managed Airflow 2 environment provides the greatest flexibility required, allowing you to: a) select the executor type required, b) configure bespoke Airflow RBAC roles, and c) install additional libraries on the Airflow instances needed with ease.

Criteria 5: Setup effort required — would the value justify the complexity of the required setup for the solution?

  • GCP Cloud Composer: this isn't a candidate option if you're not already using GCP. The amount of basic GCP account setup required can't justify this option.
  • Astronomer: I can't talk about the setup effort required because this would require a vendor agreement. As a result, a technical spike is needed for this evaluation.
  • AWS MWAA: a poor setup experience, tied in with infrequent releases and a lack of customisation, makes this option only suitable if you're looking for a quick 'no frills' Airflow 2 environment
  • Self-managed Airflow: this could be the best option if you have the K8s expertise available. However, if not, I'd recommend investigating Astronomer further.

Criteria 6: ‘Local’ Development environment —can a standalone version of the environment be easily configured and installed on a user’s machine?

A 'local' Airflow development environment can be created for all four options. However:

  • AWS MWAA: the local development environment is community-supported (GitHub repo link). As such, it's pretty buggy and prone to errors
  • Self-managed Airflow using K8s is possible, but some of the setup and config aren't trivial. Requires K8s expertise

4. Summary of the Airflow 2 Options

Google Cloud Composer

  • If you're not already using GCP as a cloud provider, the Cloud Composer service isn't a compelling enough reason to switch.
  • The effort required to stand up the foundation GCP environment, org setup, IAM, security, networking, etc., doesn't merit using Cloud Composer.
  • It's a more attractive option than AWS MWAA, but only if you're already using/within the GCP ecosystem.

Astronomer

Without assurances of the support model & support-level presence in APAC, I couldn't recommend Astronomer as a candidate for our use case. However, if we were confident about the enterprise-level support, Astronomer becomes a desirable option.

AWS MWAA

Despite the convenience of an AWS-managed service offering, MWAA feels like a rushed-to-market offering. In particular, concerning the limited/workaround customisations, infrequent releases and poor setup experience. See my previous blog post on the setup experience:

If they can provide more frequent version releases & resolve the executor type limitations & improve their general user guides documentation(!), it would go a long way in making AWS MWAA an attractive proposition.

Self-managed Airflow using K8s

Pros:

  • Complete flexibility to design the Airflow build you need
  • Complete flexibility of installation — no limitations, re: what you can & can't configure
  • Cloud-native — Kubernetes itself is a cloud-agnostic technology
  • Executor flexibility — no restriction on the choice of executor you wish to use

Cons:

  • Requires you to have platform engineers with K8s expertise (and bandwidth!)

A self-managed Airflow environment guarantees that the developed environment can be customised and built per your needs. However, this does come with the caveat that this option does require your team to have access to Platform engineers with K8s expertise.

5. Recommendation

Ultimately, the Airflow managed-service space still has room for improvement:

  • Astronomer is a strong contender but needs more of a presence in APAC and greater assurances around its enterprise support model
  • If you're not already a GCP user, Cloud Composer isn't enough of a compelling reason to move over (I'm not sure you can justify the perquisite cloud networking/security/RBAC required). Especially if you're already an AWS user, given both offerings limit you to just using the Celery executor
  • AWS MWAA is a pretty 'no frills' option, more suitable for those who want a basic Airflow install on AWS (to stay on AWS). AWS MWAA needs to offer more flexibility in its solution and more frequent Airflow version releases (i.e., not just v1.10, 2.0.1 & 2.2.2). In addition, AWS MWAA constrains you to only being able to use the celery executor

If you have Kubernetes expertise in-house, I'd recommend going for the non-managed install. This results in the following:

  • Having the design spec you require for your workload
  • A non-compromised solution, re: setup and configuration
  • A cloud-native solution

This recommendation comes with two disclaimers:

  1. You need to have access to K8s expertise. These are typically scarce and in demand, so you also need guarantees about their availability
  2. Your Kubernetes developers will require some understanding of Airflow concepts to build the K8s cluster (and use the correct executor) you need.

I hope this all helps. Astronomer could also be a great option on paper, but I'm keen to see/hear about more use cases in APAC.

--

--

Paul Fry
Geek Culture

Welsh data architect, based in Dublin. Certified in dbt, Airflow, Snowflake & AWS