Keeping up with Airflow releases

Adilson Mendonca
GumGum Tech Blog
Published in
7 min readMay 23

--

TL;DR

Since 2019, Apache Airflow has been releasing over 10 times a year from its GitHub repository. When maintaining a stable working environment from development to production, keeping up with all these releases can be challenging and time-consuming. Based on our experience, it is essential to not delay updates in order to access the latest features and improve workflows, and avoid falling into deprecation traps when releases are far apart from each other. Finding a suitable balance for updating the Apache Airflow environment is crucial in staying up-to-date with the latest innovations, features, libraries, and security updates.

The beginning

I have been working with Apache Airflow since its incubation period within the Apache Organisation. The project has grown substantially since then, with considerable community support and contributions. As a consequence of the additional features and methodologies, the core of Airflow is quite different compared to its early days.

The community’s efforts have resulted in a significant amount of code, both from new lines and a substantial amount of churn. This code needs to be released at a certain cadence, necessitating careful management. However, not everything always functions as intended, which makes keeping up with Airflow releases/tags challenging. In the past, some releases were only days apart due to severe bugs in the core that necessitated a quick fix release version. Nonetheless, the community collaborates effectively, testing, improving, and raising issues that are taken seriously by all members.

The old experiences

I still remember the early days of using Airflow when the documentation was not as comprehensive as it is today; back then, code was everything. On certain occasions, we had to use a specific commit hash for our releases to overcome issues with the release at that time. There were also instances when I had to patch a few lines of code to ensure our deployment worked in a production environment. All of this was because, during the period when code standardizing and best practices of test coverage were being applied to Apache Airflow in its incubation period, it took considerable time for some GitHub PRs in the Apache repository to be reviewed, approved, and merged into the codebase.

We decided to patch the code, which was both a fun and technically challenging task. However, keeping track of future changes proved to be difficult while maintaining Airflow across multiple environments. Some of our engineering team members disagreed with this approach, but in the end, we had to keep moving to ensure an operational Airflow for processing data in the pipeline.

All of these experiences led us to a final question: which version of Python should we use? On the one hand, there was Python 2.7, which some libraries depended on and had not been ported to Python 3. On the other hand, there were libraries that only worked on Python 3. Resolving this issue at the time was complex, but we managed to isolate the Python 2.7 workload under a shell script and within the beloved Python Virtual Environment. We decided to use Python 3 from that point forward as our target version, patiently waiting for Google and others to release some of their libraries in Python 3 — a process that took some time.

Major releases

Transitioning from Airflow version 1 to 2 was a substantial task. Certain features that functioned in one version did not work seamlessly in the other, causing a few hiccups along the way. We chose to skip the database schema upgrade and start afresh, disregarding the logs and any history, which we viewed as mere dead weight. The improved code structure greatly facilitated release management, and the constraints on the Python version proved to be a lifesaver. It made handling Python packages in Airflow much easier for us.

The new core and its features made version 2 a welcomed addition to the data processing stack in its early stages. However, subsequent database upgrades and numerous deprecations within the version 2 releases presented some challenges.

Minor releases

Every Airflow release inevitably brings bug fixes and new features. While the current release version may resolve an issue or bug, it can also potentially introduce a new bug or instability in certain areas of the product. During one of our upgrades, the ‘airflow tasks test’ command line wasn’t functioning at all. This issue was only addressed in the second or third minor release that followed. As a result, we had to tolerate a buggy feature for a while. Fortunately, it didn’t impact production, just development.

Airflow as services

Airflow as a service, provided by major cloud data centers and other companies, assists in managing Airflow versions, libraries, and packages. But (yep, there’s always a but) we often find ourselves needing something that isn’t present in their standard platform, be it a different version of a library or an additional feature.

At Playground XYZ, we encountered an intriguing problem with a specific version of Google Composer, which did not function as expected with a required Google Airflow Operator. Fortunately, their product allowed us to switch the operator provider in question to another version. This adjustment proved effective, enabling us to use the version we needed for our workload and to utilize Airflow Task Mapping.

We installed extra packages to extend the usage of Airflow for various workloads. This kind of flexibility proved to be advantageous in overcoming challenges related to library management.

Upgrade/update cadence

Apache Airflow releases occur approximately every five weeks and can include minor releases and patches. Each release comes with numerous changes, and updating Apache Airflow for each release can be time-consuming, almost amounting to a full-time job when dealing with side effects, deprecated libraries, and other issues. However, it’s crucial to find a rhythm that maintains a healthy and happy team that uses and develops the workflows.

It’s not feasible to indefinitely lock a specific version of Apache Airflow in a data platform. The introduction of new features, stability improvements, bug fixes, and enhancements are highly beneficial for developers, as these updates can enhance their efficiency in developing workflows. They’re also advantageous for infrastructure and operations, as they provide better monitoring and process management.

Updates and upgrades are also crucial for maintaining security. Outdated libraries can harbor significant issues, potentially leading to exploitable vulnerabilities. For security reasons alone, upgrading your Apache Airflow should be on the roadmap, just like any other system, to prevent potential problems.

A good cadence for updating/upgrading your platforms will depend on a blend of factors including opportunities, development requirements, business workload, DevOps workload, roadmap priorities, team size, technical capabilities, and so on. It can be tricky to establish such a cadence, as it is often time-consuming and challenging. However, updating 2 to 4 times a year might be feasible for some, or at the very least, once or twice a year would be ideal to avoid a significant upheaval in platform changes, configuration, and feature deprecation.

Learnings/Conclusion

Like any product, Apache Airflow has undergone an extensive process to enhance development workflows, standardize the code, and establish an architecture that is flexible, stable, scalable, and configurable. Plenty of exceptional features have been developed for widespread use across a range of workloads and platforms. However, there have been numerous bug fixes and library version mismatches over time that required attention, and such issues will continue to occur as the product evolves.

The most challenging aspect of dealing with Apache Airflow is managing Python library versions. Given that Airflow is intertwined with numerous libraries, dependency issues can crop up. However, Apache Airflow’s current design and architecture is truly a winner due to its flexibility, which allows for basic overwriting of original release setup version numbers to deal with such occurrences. This flexibility, when extended to Airflow as a Service, makes updating/upgrading the platform even easier. Managing Python libraries is already a headache, let alone the additional complexities of Kubernetes, databases, instances, and OS versions depending on where Apache Airflow is running. Its flexibility is so extensive that it operates across a wide range of platforms.

I certainly don’t miss the old days of patching code or using specific commit hashes to deploy a platform that served as a core data processing layer.

We’re always looking for new talent! View jobs.

Follow us: Facebook | Twitter | LinkedIn | Instagram

--

--