Airflow Evolution at Snap

Yuri Desyatnik

Published in

Apache Airflow

14 min readJan 18, 2024

Yuri Desyatnik, Zhengyi Liu, Han Gan, Nanxi Chen, Jun Gao

Introduction

There comes a time in every tech company’s history, when a large chunk of the old has to be replaced with something new. The changing landscape, which includes developments in engineering, infrastructure, security and regulations, forces internal company requirements to change as well, stretching legacy systems close to their breaking point. Snap, Inc. did not escape this destiny when it came to Apache Airflow, where this open source service was used for the majority of Snap’s data orchestration use cases. Additionally, Airflow was used as a jump-off point for a number of other data and ML technologies that represented even more applications dependent on this crucial service. Let’s dive into the story of Airflow at Snap, and describe how we transformed this service into a performant, scalable and secure enterprise-grade data-orchestration ecosystem.

It’s useful to provide the reader with a brief overview of Airflow’s history at Snap, but first for a quick introduction of our company. We are the parent company of the mobile app called Snapchat. There are around 750 million users across the world using it to talk to their close friends and families. We continue to heavily invest in augmented reality where users try out sneakers/clothes and express themselves with funny face lenses. Another interesting product we recently launched is MyAI, an Gen AI-powered chatbot. We believe it’s the largest consumer chatbot currently available today, with 150 million users sent over 10 billion messages.

Snap uses Airflow to power all the business needs of our products. We have over 3,000 DAGs that run over 330 thousand task instances every day, including ETL, reporting/analytics, some ML workloads and others. Snap runs in a hybrid cloud environment where 200+ operators are needed to support our services and use cases with 1000+ active Airflow users. To see Snap’s evolution with Airflow let’s rewind back a few years. Before 2016, most of our data workflows were scheduled on GKE and Google App Engine crons. In 2016, we set up a single Airflow deployment and started building DAGs on it. As the platform expanded, our single cluster became provisioned with too many permissions, exposing a huge security risk. So in 2018, we decided to break it down into multiple clusters, each with its own Airflow deployment. Within each cluster, we implemented a task level permission model to add more granular security, but this did not land well with engineering teams, and it soon grew from 10 to 50+ clusters, which was difficult to support and maintain. Additionally, managing task level permissions profoundly degraded developer experience. We realized that our next attempt at improving Airflow had to be done right, with little room for error.

Airflow timeline at Snap

Data Orchestration V2

When our engineering leaders converged on the decision to uplift Airflow, our team was given a clear set of requirements to follow:

Improve usability and developer experience
Minimize security risks to the organization
Create a data orchestration system which scales to 10x of the current workloads
Maintainability and supportability
Future-proof data orchestration technology for the next 5–10 years

To create a solid foundation for this new platform, we had to select and deploy an architectural design which touched on all five requirements above. The original single tenant cluster design was adequate for early Snap’s more basic needs, but as the diversity and number of workloads increased over time, this design became more strained and overloaded. Additionally, this design did not provide the means to isolate access or to segregate workloads, which became a security risk. At the time, the most expedient way out of this situation was to replicate the existing solution and multiply it by the number of teams who used Airflow. Ultimately, this resulted in over 60 isolated Kubernetes clusters running Airflow 1.10. This proliferation in Airflow environments became a maintenance nightmare for our team, and the multiple cluster structure no longer added value since the organization changed over time. As teams were renamed, merged with other teams, or were eliminated completely, the ossified list of Airflow clusters did not change to follow the organization. New teams would share old clusters to run their DAGs, renamed teams used clusters with their old team names, and many DAGs would continue to run with no team owners. We also couldn’t discover new DAGs as they were created, and this lack of visibility was both a maintenance and security problem, since we had to build and maintain special services that monitored this cluster fleet for changes, which added to the complexity of the stack. This design no longer worked after the company evolved.

Airflow architecture before our changes

Our solution to meet the new requirements was one multi-tenant Kubernetes cluster, configured to run in our latest managed service mesh environment. Kubernetes scalability allowed us to schedule many thousands of tasks in parallel, unlocking our scaling requirements. The Kubernetes Executor provides pod-level customization and makes task-level isolation possible, effectively extending our access control benefits down to the runtime level. This ultimately allowed us to create functional multi-tenancy in a single cluster environment, by integrating our in-house role-based access control solution into the UI and backend of the Airflow system, which extended to pod-level resource access. In summary, the overall structure was greatly simplified, while providing improved end-to-end isolation for access and execution. As a testament to this design improvement, the number of tenants more than doubled after all migrations were completed.

Airflow architecture after our changes

Our Airflow infrastructure story would not be complete without describing the new development environment changes that we introduced together with the new platform. Originally, teams would spin up Docker Desktop to start a local (laptop) instance of Airflow to test new DAGs or their DAG changes. This would max out their laptop compute capacity quickly, and their tests would take a lot of time, thereby slowing down their development velocity. When their laptop fans sounded like jet turbines, other work had to be paused waiting for the tests to complete. The team created a remote development cluster to liberate engineering laptops from this condition, which improved developer experience in another clear way. Additionally, local Airflow testing required engineers to use their personal credentials to access DAG resources, which violated our security best practices. The remote instance of Airflow used the same security model as production, creating the same level of isolation and least privilege for resource access. Lastly, local testing required a slightly different code path and runtime environment, which occasionally resulted in inconsistent behavior when the DAGs were deployed in production. Our remote development environment leverages Skaffold for faster development iteration. Skaffold is an open source project from Google which makes Kubernetes CI/CD fast and easy, where local code changes are automatically synced to the remote instance in a few seconds.

Remote development and backfill environment

Airflow and Access

One of the key requirements for our new Airflow system was to reduce security risk, and access control needed to be addressed. We started with an environment riddled with overprovisioned access, where employees could manage any DAG in the Airflow environment and DAGs had access to resources they didn’t need. This section describes how we improved security for automated DAG access management. We created a stand-alone Job Access Manager service with deep Airflow integration. This allowed us to decouple resource ownership from employees by removing legacy service account administration. After a new DAG is created, users leverage the Job Access Manager’s UI to define a job profile, which includes details such as DAG ID, contact information, workload staging environment and authorized users. This triggers the creation of an associated service account which is seamlessly linked to this DAG. The management of permissions follows a request-approval process, where users request all required permissions for their jobs. After these access requests are approved, the service account is empowered with the permissions automatically. There is always one service account per DAG, which is a reasonable balance between usability, maintainability and security. We made the decision to adopt one service account per DAG model because task-level access segmentation was too granular, adding significant maintenance and complexity overhead for a marginal improvement in security. The same team would have to manage hundreds of service accounts instead of a few dozen. Additionally, teams would be spending valuable time waiting for hundreds of access requests to be approved instead of building our products. For very large DAGs, we asked our customers to break them down into smaller DAGs.

Workload Isolation

To ensure that our security controls are maintained through the entire Airflow stack, we leveraged the Kubernetes Workload Identity to extend security isolation to the task pod level. Workload identity is the technology which allows us to seamlessly integrate Airflow with our service account management service — Job Access Manager (JAM). All DAGs are executed within the same GKE cluster and each DAG has a unique DAG ID. When a job profile is created in JAM, users must specify the DAG ID and its staging environment where this identity will be used. Based on this information, the Job Access Manager derives a Kubernetes Service Account, aka KSA from the DAG ID. As a next step, JAM invokes the GKE API to create the KSA in the target GKE cluster, while simultaneously binding the KSA to the corresponding GCP service account, as shown in step 2 and 3 of this diagram. When a DAG is scheduled for execution, its GKE containers are equipped with a KSA derived from the DAG ID. Because of the binding between the KSA and GCP service account, DAGs are able to obtain GCP service account credentials, and use them for resource access. This seamless integration allows us to achieve a more streamlined design and deeper integration with Airflow. Notably, there’s no need for an explicit connection ID within Airflow, significantly reducing the risk of potential credential leakage of service account keys.

Airflow and Job Access Manager workflow

Access in a Nutshell

Identity and permission management can often be significant challenges for engineers who may not be well-versed in access security best practices. The primary design goal of the job access manager was to simplify this process, and provide a great user experience for DAG developers. On the Airflow UI, users have a convenient direct link within each DAG’s status page, which redirects them to the job profile as mentioned above. Additionally, this single page also seamlessly interfaces with GCP IAM, AWS IAM, as well as internal and external services, which allows authorized users to manage permissions for the DAG without the need to navigate between different systems. Because of this convenience, users do not have to know the associated service account for a DAG in most cases, which decouples service account access from employee access, strengthening secure access.

Thanks to this all-in-one IAM approach, we can simplify permission management by consolidating certain permissions into abstract roles. For instance, when a cross-project service account is used in a Dataflow job, there are multiple different roles to be provisioned for the Dataflow service account and Dataflow worker service account. We simplify this use case by offering one single role to the user and handle the complicated configuration in the background. We have a similar approach for Dataproc and Spark running in other scenarios. This simplification is deeply appreciated by our Airflow users.

User experience of Airflow access management

Access Trimming

User experience is undoubtedly important, but security holds equal significance. In the past, service accounts were shared among multiple DAGs, resulting in the accumulation of permissions. Certain super-powerful service accounts tended to be over-provisioned. The isolation of DAG identities on the new platform makes it possible to enforce Least Privilege Principle, which basically means a job should only possess the minimum permissions necessary to fulfill its specific purpose.

Given the periodic nature of DAGs, if a permission remains unused after several DAG runs, it is highly probable that it is no longer required. As a result, we’ve implemented an automated permission reduction system by looking into access audit logs. It eliminates the need to manually ascertain which permissions are needed by the DAGs, and make the permission and security review on DAGs much more efficient.

Roles per principal reduction due to access trimming

Role Based Access Control

Controlling service account access is just one part of the Airflow security equation. The other component is managing employee access to the platform. This is achieved with our in-house Role-Based Access Control, which is integrated into both the Airflow UI and the job access manager to synchronize DAG control permissions with corresponding DAG service account stewardship. Essentially, RBAC extends Airflow’s multi-tenancy into the area of employee access. When an employee is a member of an IAM access group configured in RBAC, they have the privileges to control their team’s DAGs in the Airflow UI, and to request access to new resources for service accounts of their team’s DAGs. When those service account permissions are granted, the employees do not have direct access to those resources, which is the goal of this design — the decoupling of human and machine access.

RBAC relationship between UI access and service account permissions

CI/CD Security

With all of the above controls which we deployed to lock down the approaches to Airflow and its data, we then turned our attention inward — to the actual DAGs themselves. There are many security vulnerabilities that can be introduced through DAG Python code. Examples such as hard-coded secrets, container escape vulnerabilities, or connection parameters are just some of the undesired contents of these DAGs. Our static analysis tool performs checks at three points in the CI/CD pipeline. First, the tool analyzes DAG code at branch commit. After other checks and peer code reviews, the tool makes another pass after the commit to the main branch. Finally, there is a daily main branch check in case the DAG was modified through other deployment methods. As new potential vulnerabilities are added to our database, this static analysis tool also checks for them.

(DAG SDLC with vulnerability static analysis)

Migration

After our team created the new Airflow environment, we had to think about how to move over a hundred teams with thousands of DAGs from legacy Airflow to the new system. These old DAGs were written in Python 2, and were running in production with different access profiles and numerous cross-dependencies on other V1 DAGs. Engineering teams were weary of migrating to the new platform since some of them had difficult experiences with similar migrations in the past. The way to an engineer’s heart is to give them usability improvements and to save them time. Fortunately for us, Airflow V2 offered major usability improvements over the previous version, with a nicer UI, better diagnostic features and a faster scheduler, to name a few. Despite this appeal, it still took work to migrate the DAGs. Our team launched Airflow, together with several migration tools, which included a Python2 -> Python 3 conversion tool, and Job Access Manager functionality to automatically convert the old over-provisioned service accounts to new DAG-specific credentials of Airflow V2. Another migration challenge was operator availability. Airflow V1 DAGs utilized over 200 different operators that stitched together many different systems and technologies. We launched Airflow V2 with new secure operators that were used in 80% of the DAGs which required migration.

Our goals for migration were a good user experience, providing support to our customers, and a seamless migration with minimal production interruptions. To achieve these objectives, we defined the following process:

Freezing the V1 DAG from any further code changes.
Run our flowcli command, which creates a new DAG in Python 3 with most of the changes necessary for migration.
Then our Job Access Manager tool is run which extracts access permissions of the V1 DAG’s service account(s), and creates a dedicated service account for the new DAG with trimmed access permissions.
The new DAG is tested on our remote Airflow instance to make sure that it works and the render comparison is the same.
A PR is created, approved and merged to the main branch.
The V1 DAG is turned off and the V2 DAG is turned on.

As we moved more and more customers to the new Airflow, we continually sought ways to accelerate and improve the migration experience. One of our engineering colleagues — Jun Gao — invented a clever method for fast-tracking the migration of specific DAGs which had a more simple structure and operators. We called this new process and tooling “auto-gen DAGs”. An example of these simplifications is that the DAGs are “flattened” (for loops that generate tasks are converted into static tasks). This process involves three Python scripts and Airflow V2 DAG task code templates. The first two scripts collect the V1 DAG metadata, while the V2 RBAC policy creator script is used to create the RBAC policy for the DAG. The V2 Code Generator ingests the metadata and task code templates to create the V2 code DAG files. The final product is a working DAG which is provisioned with all the needed permissions and access policy to run correctly from the start. This works well for simpler DAGs with not much custom logic, and was applicable to around 40% of the DAGs we had to migrate, shortening migration time from 40 to 5 minutes on average.

Three Takeaways

To summarize our project, there are three main areas where we performed a significant uplift of the Airflow platform at Snap. Our infrastructure was converted from many independent clusters to one multi-tenant cluster, and we created a dedicated remote server for DAG testing and backfill operations. In the security domain, we deployed RBAC in front of the Airflow UI and to control DAG service account access. To maximize DAG execution isolation we enforce one service account per DAG, which is mapped to the Kubernetes pod workload identity. And finally, we enabled static analysis of DAG code for security vulnerabilities. For DAG migrations, we focused on customer experience and maximum automation. We maintained flexibility in our approach for different customers, taking into account their constraints and the current business environment. We also obtained executive support in case of pushback. Through all these efforts, we were able to address all of our requirements by transforming Airflow into a performant, scalable and secure enterprise-grade system, which will be a strong foundation for data and ML technologies at Snap in the future.