Why we use Cloud Composer

Benefits and costs of using Airflow in a cloud-native environment

François Blanchard
Tinyclues Vision
8 min readDec 15, 2022

--

Apache Airflow has long been a top choice for developers to manage and orchestrate complex data workflows in a structured way. Google Cloud Composer brings Airflow to the cloud and offers it in a fully-managed cloud-native environment.

What advantages does this bring to developers, and what are the potential downsides?

At Tinyclues, we’ve long been using Apache Airflow to orchestrate scheduled data jobs and client-facing applications, all within a multi-tenant setup. The following article builds on our experience moving Airflow workflows to GCP and how it enabled us to streamline our Airflow use, save resources, and monitor workflows more effectively.

Resource Provisioning in Airflow

Workflows in Airflow are organized into Directed Acyclic Graphs (DAGs) that run either on a pre-defined scheduled or are triggered by an external event. Airflow allows all DAGs to be created, modified, and monitored in a unified way and run on shared resources. But what resources do you need to ensure all workflows are executed without ever incurring significant delays?

Different types of triggers for Airflow DAGs

In a multi-tenant setup like Tinyclues, each client’s setup will require multiple DAGs, which can be triggered either by a scheduler (data loading and transformation) or by an event. Events, in turn, include manual triggers (backfills and other data jobs) as well as client input when interacting with the platform.

Airflow UI displaying DAGs in a multi-tenant setup with multiple DAGs per client

Each time a DAG is triggered, no matter if via a scheduler or an event, it is added to a queue of tasks to be executed by a particular number of workers.

If you are only dealing with scheduled data jobs, workflows can easily be spread out over time and run during off-peak hours to avoid unnecessary resource use. It would likely be enough to start up a worker at midnight, let it run for a couple of hours, and shut it down in the morning. Even if jobs started to queue up, it wouldn’t be an issue as they are not time-sensitive.

But things are different when it comes to client-facing jobs. First, you want to guarantee that your clients can access all platform functionality around the clock. Thus, some workers (at least one) should be up-and-running at all times, regardless if there are any scheduled jobs or not.

More importantly, however, if a user requests an action on the platform, it should be executed instantaneously to guarantee client satisfaction which implies that it should never be queued up after endless other jobs.

You could, of course, mitigate this problem by provisioning additional workers during working hours to account for the increased use of your platform, but this is unlikely to be effective once your platform is used by hundreds of companies in various time zones, each with individual working hours and holiday schedules.

Clearly, you can never fully predict when users will be using the platform and will likely have to permanently over-commission workers to cover peak demand.

Instant Scalability with Cloud Composer

Let’s now look at how you can use Google Cloud Composer to scale your environment to the current demand automatically.

Being a fully managed service, Cloud Composer can auto-provision workers to match the current workload. That is, users of Cloud Composer can set a threshold for the job queue of a given worker, after which a new one will be added.

This means that in off-times with low demand, it will be enough to keep only one worker running to handle hundreds of client setups, while during peak hours, dozens of workers could jump in almost instantaneously.

As a result, the tasks queue remains under control, and delays on the client side are avoided without provisioning excessive resources and incurring unnecessary costs.

Even better, the configuration setup is extremely straightforward in the Cloud Composer UI. Users pick the server configuration and set a range for the number of workers — that’s it!

Made for GCP

Cloud composer’s advantages are by no means limited to dynamic scale so let’s discuss three more benefits that stem from its integration with the broader GCP ecosystem.

Integration with Cloud Log Explorer

All logs produced by Airflow are automatically tagged and categorized in a unified UI, making it easy to browse through them and spot any potential irregularities.

Additionally, logs are pushed into GCP’s Cloud Log Explore, where they can query in the same manner as all logs produced by GCP services which massively any deeper analysis.

Native Monitoring UI

Cloud Composer further provides a native monitoring UI, which gives you an overview of your system in one glance.

Health monitoring view
Example Screenshot from a real Composer Environment

Statistics about the number of completed runs and tasks are easily available via pre-build dashboards, which can be customized to your needs.

Not only does this simplify monitoring and eliminate the need to build custom data dashboards, but it will also lower the barrier to getting started with Cloud Composer when you first set up the service.

Integration with GCP IAM

Talking about onboarding, Cloud Composer also integrates with Google’s Identity and Access Management tools, allowing you to add team members and set up custom roles easily.

We believe that this has two distinct benefits. First, it can simplify onboarding as access to Composer can be granted centrally from GCP IAM alongside access to other services. Second, especially for larger organizations, it is east to grant access to specific features within composers. For example, it will be easy to give some users full access while others might be restricted to analytics and monitoring.

Room for Improvement

Before migrating your workflows to Cloud Composer, here are some things you should consider.

First released in 2021, Cloud Composer is still a relatively young service, and heavily relies on Google’s Kubernetes Engine, specifically, GKE Autopilot to dynamically scale.

This means that if you happen to encounter any bugs (which is always possible, especially for a young service), it might not be immediately clear which of the two teams, Composer or GKE, to consult to resolve the issue. As such, certain problems can take a little longer to clear up, even if Google Customer Engineers of both teams are always quick to respond.

This coordination across teams might also be the reason that once a new version of airflow is released, you typically do not immediately have access to it, and it could take several weeks before its features are ported over to Cloud Composer. So, before you rush to build on the newest airflow feature, you might want to double check if it is indeed already available and plan your dev schedule accordingly.

Naturally, as the adoption of Cloud Composer increases and Google dedicates more resources to the project, we believe these issues to disappear gradually.

Fully Managed ≠ Fully Configured

As said before, Composer does a marvelous job at scaling workers according to current demand. It does not mean however, that you have no more configuration to do at all.

While the number of workers scales dynamically within a set range, the number of schedulers will still have to be set in advance, alongside the configuration for both schedulers, workers, as well as the server.

Getting this right might take some more trial and error but if you hosted Airflow without Cloud Composer before, you likely won’t have too much of a hard time doing so.

Upgrades

That said, if there is one thing, we would wish for in the upcoming versions of Cloud Composer, it would be to improve the handling of upgrades once an update is released. Taking a look at the composer environment configuration, you will notice an upgrade button that could give you the wrong illusion of some kind of magical upgrade process that smoothly transitions from one version to another.

The reality is somewhat more pragmatic. Once you hit upgrade, your airflow environment will stop accepting new tasks and finish up its job queue until it shuts down, installs the upgrade, and restarts. There will inevitably be some downtime, which could last between a few seconds and a few minutes.

Is that an issue? That will depend on your application. For DAGs triggered by schedulers and manual triggers, there is no problem at all. For those triggered by external events such as user input, you will have to accept some downtime and warn users beforehand about the scheduled maintenance outside working hours.

Most likely, it’ll be fine. Yet, as the adoption of Cloud Composer increases, the Google team might want to do its magic and find a way to smoothen this process to be more instantaneous.

Conclusion

Cloud Composer can help you significantly cut the time spent managing Airflow, prevent bugs, and save resources. In particular, its autoscaling of workers means you don’t have to bother with Kubernetes cluster management, while its added features allow for easier monitoring, logging and access control.

That said, if you want to keep full control over the underlying Kubernetes cluster, have access to all Airflow upgrades from day one, and want to custom-manage your upgrade cycle to guarantee zero-downtime, you might want to wait for a more mature version of the service before porting over your organization’s workflows.

For us at Tinyclues, the benefits have long outweigh the costs and we are excited to see what other benefits Composer might have to offer in the future.

--

--