Transitioning from self-hosted to AWS managed workflows for Apache Airflow

Published in

affinityanswers-tech

7 min readDec 22, 2023

[DALL·E] AI generated image — Transition to cloud service

We’re a small group of data engineers and our data workflows comprises of ~30 Airflow DAGs which drives the core of our operations. As engineers, it’s likely to be concerned about the situation that - what if some problem arises leading to downtime followed by delay/disturbance in scheduled workflows ?

In this blog, we’re going to delve into how adopting a cloud-native strategy helped us tackle maintenance challenges as we shifted to MWAA (AWS Managed Workflows for Apache Airflow). We’ll focus on understanding what this change meant for our team and aim to provide insights on how we prepared to make this transition seamless and disruption-free.

Current architecture

Let’s begin by discussing the architecture of our existing self-hosted Airflow setup. While it has been effectively supporting our data processing operations, we explored on MWAA to leverage scalability and manageability with ease.

Components in our architecture

It is composed of 5 core components -

RabbitMQ - First key component that acts as a message broker. It handles task queuing and inter-component communication, ensuring that tasks are efficiently queued.
Worker - Worker is responsible for the execution of tasks. It fetches queued tasks from RabbitMQ, processes them as per the defined workflows, and updates the status back to the central database.
Scheduler - Involved in monitoring the state of tasks, scheduling them based on dependencies and predefined schedules. It ensures that the workflow adheres to the planned sequence and timing.
Web server - Provides a graphical interface through browser for real-time monitoring
Postgres Database - Keeps track of workflow executions that serves as the central repository for all workflow metadata, credentials, and state information. Database is crucial for maintaining the integrity and tracking the progress of various tasks.

Challenges with Self-Hosted Airflow

Setting up the airflow alone isn’t the end of effort, we need to revisit and manage it accordingly when required. Let me highlight a few possible hurdles that can arise as part of this -

Maintenance and scalability issues
▹ Scheduler Overload During Peak Times
▹ RabbitMQ Bottlenecks▹ Unexpected Traffic Surge
Resource Management and Overhead
▹ Database Performance and Scaling
▹ Worker Resource Allocation
Security Concerns
▹ Database and Webserver Security

Because of lot of workflows running regularly the disk space might get consumed, with that either the worker or scheduler gets disturbed that fails to handle upcoming scheduled tasks. So, balancing all the components with required amount of resources is crucial.

These challenges highlight the need for careful planning, resource allocation, and technical know-how when opting for a self-hosted Airflow solution and scaling them have prompted us to seek a solution that addresses these issues.

Here comes the most awaited part !!! and why we might prefer cloud-based solutions to avoid these complexities.

Exploration on MWAA

With MWAA(AWS Managed Workflows for Apache Airflow), you can use Apache Airflow and Python to create workflows without having to manage the underlying infrastructure.

Advantages of MWAA

✓ Automatic Airflow setup & scaling
✓ Built-in authentication & security
✓ Streamlined upgrades and patches
✓ Workflow monitoring & Integrations with services

Exploration on setup

Creating environment

To start with setting up the airflow, we need to create an airflow environment through AWS Console using the service Managed Apache Airflow. Creating an environment was a breeze, offering us the flexibility to choose Airflow version, Storing of DAG files in S3, Networking, Scaling of resources for airflow components (like scheduler, web server, worker), Monitoring and Logging, Additional custom configuration and Permissions that fits our workflow needs.
All these are customizable while creating the environment as shown in below image -

“With this our airflow setup is complete”, how easy was that !!

S3 as Storage for airflow files

Now it’s time to add some DAGs. To store the DAG and supporting files to be accessed by airflow environment, S3 bucket(dedicated to airflow and selected while creating environment) is to be selected providing required access. S3 should have 3 directories and it follows this structure to access DAGs and install requirements :

Dags — All the python DAG files should be stored in “dags” directory
Plugins — If DAGs depend on custom plugins to run, it should be stored as a zipped file in the same S3 bucket. More on plugins
Requirements — Additional DAG dependencies to be installed are to be stored in requirements as a text file.

Monitoring & Logging :

MWAA uses AWSSCloudwatch by default to send the logs related to scheduling, dag processing, task logs, web server and worker. Here’s how the Apache Airflow logs are stored to cloudwatch.

Cloudwatch Logs for individual components

What are the underlying services used by MWAA ?

Besides cloudwatch for logging and S3 for storage, it uses the following AWS services -

Simple Queue Service(SQS) - Helps in queueing up work for workflows
Elastic Container Registry(ECR) - Fargate stores containers images with workers & schedulers
Key Management Service(KMS) - Securely Generate and Manage AWS Encryption Keys

So far, so good — our exploration journey was thoughtful and convincing enough to proceed with the start of migration phase.

Preparation for Migration

Evaluating existing workflows and dependencies : There are many DAGs in our self-hosted airflow and those handled by respective DAG owners. So, all of those have their own dependencies to be considered.
Planning and strategizing the migration process : We started to plan our migration to ensure stability and week-by-week few DAGs are moved from current Airflow to MWAA which will be monitored for initial run.
Ensuring minimal downtime and workflow continuity : With this approach of moving few DAGs at-a-time will helps us avoid downtime.

Exporting Variables & Connections

Our airflow had lots of variables and connections in it, there has to be some way to easily copy all the variables to our new environment, here’s how we did it by exporting all of them through UI.

For connections, there is no direct way to export all at once, even if we find one — it’s not ideal to follow that procedure to export. Because airflow by default comes with ~40 connection variables and the connection defaults were different for older airflow versions(1.X.X) that we used and the newly configured airflow version(2.6.3) in our case. So, exporting all of them will overwrite default variables which isn’t ideal.

Enhancements and Best Practices

1. Deploy DAGs to S3 with bitbucket pipeline

We’ve also created an automated workflow that will upload the DAG file to the target S3 location once we commit the DAG file to the Airflow repository in Bitbucket.

Here’s the yaml script that I used for workflow:

image: atlassian/default-image:2

pipelines:
  default:
  - step:
      name: Deploy to S3
      deployment: production
      script:
      - |
        # Install and configure AWS CLI
        apt-get update
        apt-get install -y awscli
        aws configure set aws_access_key_id $AWS_ACCESS_KEY_ID
        aws configure set aws_secret_access_key $AWS_SECRET_ACCESS_KEY
        aws configure set region "us-west-2"
        aws configure set output json

        # Determine the commit ID or branch
        commit_id=$(git rev-parse HEAD)

        # Use Git to list the files in the current commit
        changed_files=$(git diff-tree --no-commit-id --name-only -r "$commit_id")
        echo "Changed files in this commit : $changed_files"

        # Iterate through modified files and upload to S3 if in 'dags/' directory
        for file in $changed_files; do
          if [[ "$file" == "dags/"* ]]; then
            echo "Uploading $file to S3"
            aws s3 cp "$file" s3://<bucket>/dags/
          fi
        done

2. Development Environment for new DAGs

Since we configured DAG updates in a way where it can be viewed in airflow UI only after commit, it’s not possible to test or see the structure of pipeline visually. So, we have added a test environment using MWAA local runner where everyone can test/add DAGs and use it as a development environment.

What is mwaa-local-runner ?

This repository provides a command line interface (CLI) utility that replicates an Amazon Managed Workflows for Apache Airflow (MWAA) environment locally. This helps us keep the production environment free from unused DAGs.
Here’s the repository link — GitHub

The setup is quick and easy as described in the repository readme and it involves building a docker image followed by spinning up the container which will start the airflow environment and Airflow UI is locally accessible through your browser.

Conclusion

With AWS MWAA, we can alleviate the burdens of maintenance, enjoy seamless scalability, and enhance security, all while benefiting from the robust, managed services that AWS provides, it’s about embracing a future where workflow management is more streamlined, more resilient, and more aligned with the evolving needs of businesses.

🙌 That’s a wrap, Airflow enthusiasts!!! May the data pipelines be ever efficient and scalable.