Apache Airflow Grows Up!

Sid Anand
4 min readJan 8, 2019

--

Preamble

Today, the Apache Software Foundation (ASF) welcomed Apache Airflow, a popular open-source workflow scheduling platform, to its ranks as its 200th active TLP (Top-level Project). This caps a 2.5+ year journey through Apache Incubation. This milestone could only have been achieved through the tireless efforts of a community of users, contributors, maintainers, and PMC members dedicated to improving the lives of fellow data scientists, data engineers, & ML/AI engineers who need to manage complex workflows.

Apache Airflow, if you are unfamiliar with it, is a workflow or DAG (Directed-Acyclic Graphs) orchestration system that allows users to author workflows in Python. This “DAG-as-code” paradigm was first created by Spotify, with the advent of Luigi. Luigi brings the power and goodness of software development best practices to the world of workflow management (e.g. Version Control Systems, peer-reviewed code, CI/CD).

Apache Airflow is the brain-child of Maxime Beauchemin, an engineer from Airbnb who now calls Lyft his home. In the Summer of 2015, I found myself seated in the audience of Max’s talk at Hadoop Summit. As it turned out, as Agari’s Data Architect, I was in dire need of a cloud-friendly and developer-friendly workflow solution to manage our predictive batch data pipelines. As an ex-LinkedIn engineer, I was familiar with both Azkaban and Apache Oozie. Both of those frameworks, while mature, relied on config files (e.g. XML) to bundle dependent code together. However, for workflows of reasonable complexity, these frameworks made managing DAGs very cumbersome.

Luigi, while both mature and supporting “DAGs-as-code”, didn’t offer the attractive UI that Apache Airflow did. Airflow’s beautiful and intuitive UI, an engineer’s first introduction to Apache Airflow, is a key reason for its popularity and rapid adoption.

Airflow’s path to Apache

In the Fall of 2015, as more companies adopted Airflow, Maxime found himself burning the candle from both ends to meet new bug reports and a growing request for new features. It was clear that Max was near burnout. With 30 companies depending on Airflow for critical business needs, it was essential that we scale the project out beyond the resourcing of one company, namely Airbnb.

Airbnb, at the time, was new to the Apache way and had not yet signed over any of its software to the ASF. After a few emails with Max and others at Airbnb, Airbnb was “bought in”. Joining the ASF would be a tech brand boost for Airbnb as it would attract engineers interested in making lasting, widely-impactful software contributions. Additionally, it would safeguard other companies using Airflow from any personnel changes at Airbnb.

Fast forward to March 2016, our incubation proposal was voted in and the initial committers, with the help of mentors Jakob Homan and Hitesh Shah, were ready to start learning the vast code base, adding more integrity controls, all while supporting a growing user base.

Airflow by the Numbers

Over the past 2.5 years, we have added 9committers and PMC members to round out our cadre of 17 PMC/committers. We grew from 30 companies at incubation start to 234 companies officially using Airflow today. We added 600+ contributors and merged ~3k Pull Requests (a.k.a. PRs). We have active weekly participation on various email lists and slack channels to the tune of 800+ people.

Airflow Today!

Today, Apache Airflow has grown in multiple dimensions. It supports 20+ hooks and 30+ operators that bind it to multiple 3rd party systems. It is the scheduler that underlies Google’s Cloud Composer service. It’s used for critical data movement and ETL needs at various companies, such as PayPal, my current employer. If you haven’t used it yet, we welcome your usage, feedback, and contributions. Come join the movement!

--

--