Ephemeral Environments for Apache Airflow

Published in

Go City Engineering

4 min readAug 2, 2023

What? Why? And how?

Development environments for Airflow are tricky. The necessary components for a working solution can be hard to assemble. In a previous post I walked through how to set up Airflow to run locally on a Kubernetes cluster using kind. In this article I will set out a different approach. We, in the data team here at Go City, have recently been migrating our pipelines from Hitachi Pentaho to Apache Airflow and have found that, for various reasons, having a local Airflow setup per developer just didn’t cut it. We needed more. With a little help from our Platform team and my rusty Golang skills, we set up something I’m rather proud of: ephemeral environments.

What?

Firstly, if you’re reading this article, you’re most likely familiar with Apache Airflow but here’s a refresher just in case. Apache Airflow is a workflow orchestration tool for writing, scheduling and monitoring workflows or pipelines encoded in Python as DAGs (or Directed Acyclic Graphs). It is an exceptionally flexible tool capable of orchestrating more or less anything that Python allows you to do. Here at Go City, we have many DAGs that take data from APIs, databases or our data warehouse and perform tasks with it: sending it out to third parties, making it available for reporting, et cetera.

Ephemeral environment is a fancy way of saying a development environment that doesn’t last very long. In the case of Airflow, an environment typically means the usual concert of services (scheduler, webserver, etc.) exists for a short time. In our specific case, this means the lifetime of a feature branch in the git repository containing the code that defines our DAGs. The ephemeral environments themselves consist of Airflow services running in a development Kubernetes cluster. Each environment is “namespaced” such that each distinct environment is identifiable as relating to a particular branch or feature.

Why?

In the data team at Go City, we believe our development environments should be:

isolated - one developer’s work or testing should not affect another’s;
representative - environments should consistently look, feel and act like our production environment. Since each environment is built from identical configuration, we should avoid any “works on my machine” moments;
connected - each environment should connect to our real dev data warehouse and the rest of the AWS services we use.

With our ephemeral environments, each engineer’s development is kept separate while still allowing us to interact with the real AWS services we need for testing. Our individual Airflow setups behave just like our production installation. We have our own world in which to develop our DAGs without interference but with full connectivity.

There are several other benefits to ephemeral environments: resource efficiency — environments are only running when they’re actively being worked on; testing application configuration won’t disturb other developers as the changes are isolated; and use as part of a CI/CD pipeline — build an environment, run all your tests against it and destroy it, to name but a few.

How?

Consider the experience of a data engineer starting a new piece of work. A new feature is on the horizon. It turns out a DAG is needed for a new ETL process. How does the engineer spin up their environment? It’s as simple as pushing a new git branch. How is that possible?

It starts with the idea of a repository watchdog that has one task: when branches change in our git repository it responds. If a new branch is pushed to the remote repository, it will spin up a new environment. If a branch is deleted, it will tear down the environment associated with that branch.

We did this using webhooks. With Github, it is very easy to set up rules whereby events are sent out to a given URL whenever a change occurs.

The repowatchdog application is written in Golang and is deployed in our development Kubernetes environment. It clones the git repository that contains all of our helm charts for Airflow and uses the branch to name the release therefore knowing which one to remove when a branch is deleted. It also instructs the git-sync sidecar containers to clone the particular branch for which the environment was spun up, allowing the engineer using it to see only their changes.

In this way, we spin up distinct development environments per-feature. The application is fairly simple, using just a simple webserver to receive the webhooks and responding with a few helm commands.

Conclusion

I thought the cost of setting up ephemeral environments would outweigh the benefits but, given how easy it turned out to be, I was wrong. The most difficult bit was wrangling the Golang, which I’ve used only sparingly — ultimately, that choice was unnecessary — I could have written it in Python and it would have taken less time. You live and you learn.

I hope this article has been a helpful guide to ephemeral environments. They are fantastic when they work well, giving the freedom to try things out without affecting the rest of the team’s work. I hope you give them a go!