Airflow: a workflow management platform
Airbnb is a fast growing, data informed company. Our data teams and data volume are growing quickly, and accordingly, so does the complexity of the challenges we take on. Our growing workforce of data engineers, data scientists and analysts are using Airflow, a platform we built to allow us to move fast, keep our momentum as we author, monitor and retrofit data pipelines.
Today, we are proud to announce that we are open sourcing and sharing Airflow, our workflow management platform.
DAGs are blooming
As people who work with data begin to automate their processes, they inevitably write batch jobs. These jobs need to run on a schedule, typically have a set of dependencies on other existing datasets, and have other jobs that depend on them. Throw a few data workers together for even a short amount of time and quickly you have a growing complex graph of computation batch jobs. Now if you consider a fast-paced, medium-sized data team for a few years on an evolving data infrastructure and you have a massively complex network of computation jobs on your hands. This complexity can become a significant burden for the data teams to manage, or even comprehend.
These networks of jobs are typically DAGs (directed acyclic graphs) and have the following properties:
- Scheduled: each job should run at a certain scheduled interval
- Mission critical: if some of the jobs aren’t running, we are in trouble
- Evolving: as the company and the data team matures, so does the data processing
- Heterogenous: the stack for modern analytics is changing quickly, and most companies run multiple systems that need to be glued together
Every company has one (or many)
Workflow management has become such a common need that most companies have multiple ways of creating and scheduling jobs internally. There’s always the good old cron scheduler to get started, and many vendor packages ship with scheduling capabilities. The next step forward is to have scripts call other scripts, and that can work for a short period of time. Eventually simple frameworks emerge to solve problems like storing the status of jobs and dependencies.
Typically these solutions grow reactively as a response to the increasing need to schedule individual jobs, and usually because current incarnation of the system doesn’t allow for simple scaling. Also note that people who write data pipelines typically are not software engineers, and their mission and competencies are centered around processing and analyzing data, not building workflow management systems.
Considering that internally grown workflow management systems are often at least one generation behind the company’s need, the friction around authoring, scheduling and troubleshooting jobs creates massive inefficiencies and frustrations that divert data workers off of their productive path.
After reviewing the open source solutions, and leveraging Airbnb employees’ insight about systems they had used in the past, we came to the conclusion that there wasn’t anything in the market that met our current and future needs. We decided to build a modern system to solve this problem properly. As the project progressed in development, we realized that we had an amazing opportunity to give back to the open source community that we rely so heavily upon. Therefore, we have decided to open source the project under the Apache license.
Here are some of the processes fueled by Airflow at Airbnb:
- Data warehousing: cleanse, organize, data quality check, and publish data into our growing data warehouse
- Growth analytics: compute metrics around guest and host engagement as well as growth accounting
- Experimentation: compute our A/B testing experimentation frameworks logic and aggregates
- Email targeting: apply rules to target and engage our users through email campaigns
- Sessionization: compute clickstream and time spent datasets
- Search: compute search ranking related metrics
- Data infrastructure maintenance: database scrapes, folder cleanup, applying data retention policies, …
Much like English is the language of business, Python has firmly established itself as the language of data. Airflow is written in pythonesque Python from the ground up. The code base is extensible, documented, consistent, linted and has broad unit test coverage.
Pipeline authoring is also done in Python, which means dynamic pipeline generation from configuration files or any other source of metadata comes naturally. “Configuration as code” is a principle we stand by for this purpose. While yaml or json job configuration would allow for any language to be used to generate Airflow pipelines, we felt that some fluidity gets lost in the translation. Being able to introspect code (ipython!, IDEs) subclass, meta-program and use import libraries to help write pipelines adds tremendous value. Note that it is still possible to author jobs in any language or markup, as long as you write Python that interprets these configurations.
While you can get up and running with Airflow in just a few commands, the complete architecture has the following components:
- The job definitions, in source control.
- A rich CLI (command line interface) to test, run, backfill, describe and clear parts of your DAGs.
- A web application, to explore your DAGs definition, their dependencies, progress, metadata and logs. The web server is packaged with Airflow and is built on top of the Flask Python web framework.
- A metadata repository, typically a MySQL or Postgres database that Airflow uses to keep track of task job statuses and other persistent information.
- An array of workers, running the jobs task instances in a distributed fashion.
- Scheduler processes, that fire up the task instances that are ready to run.
While Airflow comes fully loaded with ways to interact with commonly used systems like Hive, Presto, MySQL, HDFS, Postgres and S3, and allow you to trigger arbitrary scripts, the base modules have been designed to be extended very easily.
Hooks are defined as external systems abstraction and share a homogenous interface. Hooks use a centralized vault that abstracts host/port/login/password information and exposes methods to interact with these system.
Operators leverage hooks to generate a certain type of task that become nodes in workflows when instantiated. All operators derive from BaseOperator and inherit a rich set of attributes and methods. There are 3 main types of operators:
- Operators that performs an action, or tells another system to perform an action
- Transfer operators move data from a system to another
- Sensors are a certain type of operators that will keep running until a certain criteria is met
Executors implement an interface that allow Airflow components (CLI, scheduler, web server) to run jobs jobs remotely. Airflow currently ships with a SequentialExecutor (for testing purposes), a threaded LocalExecutor, and a CeleryExecutor that leverages Celery, an excellent asynchronous task queue based on distributed message passing. We are also planning on sharing a YarnExecutor in the near future.
A Shiny UI
While Airflow exposes a rich command line interface, the best way to monitor and interact with workflows is through the web user interface. You can easily visualize your pipelines dependencies, see how they progress, get easy access to logs, view the related code, trigger tasks, fix false positives/negatives, analyze where time is spent as well as getting a comprehensive view on at what time of the day different tasks usually finish. The UI is also a place where some administrative functions are exposed: managing connections, pools and pausing progress on specific DAGs.
To put a cherry on top of this, the UI serves a Data Profiling section that allows users to run SQL queries against the registered connections, browse through the result sets, as well as offering a way to create and share simple charts. The charting application is a mashup of Highcharts, the Flask Admin‘s CRUD interface and Airflow’s hooks and macros libraries. URL parameters can be passed through to the SQL in your chart, and Airflow macros are available via Jinja templating. With these features, queries, result sets and charts can be easily created and shared by Airflow users.
As a result of using Airflow, the productivity and enthusiasm of people working with data has been multiplied at Airbnb. Authoring pipeline has accelerated and the amount of time monitoring and troubleshooting is reduced significantly. More importantly, this platform allows people to execute at a higher level of abstraction, creating reusable building blocks as well as computation frameworks and services.
We’ve made it extremely easy to take a test drive of Airflow while powering through an enlightening tutorial. Rewarding results are a few shell commands away. Check out the quick start and tutorial sections of the Airflow documentation, you should be able to have your an Airflow web application loaded with interactive examples in just a few minutes!
Check out all of our open source projects over at airbnb.io and follow us on Twitter: @AirbnbEng + @AirbnbData
Originally published at nerds.airbnb.com on June 2, 2015.