Load Balance Fivetran Syncs with Apache Airflow

4 min readNov 5, 2023

We use Fivetran to ELT a variety of customer data sources (think their sales, product usage, support ticketing, calls data) into data lake, perform aggregation and enrichment, then service transformed data to customer facing data applications. When we first onbroaded 300 Fivetran connectors production, we spent more money loading raw data (dark red) than processing it (dark blue). Airflow enabled us to load balance Fivetran connectors and reduce our ELT cost by 3/4.

What is Airflow?

Apache Airflow is an open-source workflow management platform for data engineering pipelines. It started at Airbnb as a solution to manage the company’s increasingly complex workflows.

Why Airflow?

Airflow is not the only orchestration platform in the modern data stack. Overall, Airflow wins by:

- Strong open source Apache Airflow community
- Many managed versions to choose from
- Wide range of Operators
- Easy to extend custom operators

Airflow weaknesses are:

- Does not version control pipelines
- Does not solve data lineage
- Does not work for streaming jobs
- Difficult to test locally

A Unique Use Case

We want to use an orchestration tool for data connectors to better control replication sync schedule, loadbalance compute resources, and centralize error handling. This orchestration service has strong scalability, observability and financial impact: We can prevent resource spikes and over provisioning as the classic SQL endpoints used to write data does not scale down well.

Problems to Solve

Take Control over Fivetran Sync

We use Fivetran to ELT customer Salesforce data into our data lake. There are over 500 connectors sharing a single SQL endpoint today. The out of the box sync schedules could only be set as every x minute/hour, with a Fivetran managed offset. We have neither control nor insight over how effective this is. We want to be able to set the effective schedule ourselves.

Cloud Watch: Unpredictable spikes and poor utilization of clusters

https://fivetran.com/docs/getting-started/syncoverview#syncfrequencyandscheduling

Scale Up/Down Compute Resources

When a large number of Fivetran connectors are onboarded, or when there is a large influx of incoming data, the endpoint needs to be scaled up. We can use the autoscaling feature of DB SQL, but it has limitations: Since we could not control data spikes, the cluster did not have much opportunity to scale down. We want to be able to group connectors and process data one batch at a time, resulting in well load balanced compute resources.

We want flat lines, and fewer spikes.

Manage End to End Dependencies

Currently, we rely on historical data to determine when to move on to the next group of connectors, when all the Fivetran sync completes, and when to run downstream Databricks jobs. We want to be able to use Deferred Operators, and on success start the data processing pipelines. This will also give us more ability to answer questions such as how long does it take for specific data to appear in the customer facing data application.

Automate the Restart of Connectors

API and temporary connectivity failures are common in connector development. Before Airflow, our connectors had to be manually started and stopped. We discovered we could use the /test endpoint to revive connectors, leaving only customer authentication issues that require human intervention.

Airflow Components

Overall, Airflow’s architecture includes many components:

- Scheduler (main orchestration)
- Webserver (UI interface)
- Metadata database
- Worker nodes
- Executor (choice of celery vs k8s)
- Message broker + Celery + Flower if choosing Celery executor
- Container registry + Kubernetes Pods if choosing K8s Operator
- DAG/Log storage buckets on S3
- Shared volume such as EFS for sharing DAGs across workers and airflow.cfg
- Monitoring (to prevent your metadata database and worker nodes from exceeding their compute capacity and storage)
- Optionally registering a domain for the UI
- Auth layer for the UI and access to the metadata database

Airflow Choice — MWAA

There are four common deployment choices of Airflow. (And yes, I have used all of them!) We choose Amazon Managed Workflow for Apache Airflow because it offers a good balance between owning our infrastructure and outsourcing resource management. With MWAA, it is easier to manager the many Airflow components in production. At the same time, Airflow will live in our own cloud environment without dependency on another SAAS/PASS vendor, i.e. Astronomer. The pricing is fairly predictable and typically 1/2 of Astronomer platform costs.

https://aws.amazon.com/managed-workflows-for-apache-airflow/pricing/

Risks

While Airflow is open source, MWAA is a managed version of Airflow, so not all features or functionalities that are available in the open source Airflow are available in MWAA.

Python dependencies proves to be tricky. Treat the requirements and constraint files carefully, and use local running for testing.

References

Nomenclature

What is a SQL endpoint?
Databricks SQL(DB SQL) is a compute environment that allows SQL developers to do ETL, analytics, and dashboarding work. It is commonly used to create an enterprise data warehouse experience. There are three types of SQL endpoints:

Classic/Pro: Uses EC2 instances in our AWS environment.
Serverless: Uses Databricks serverless technology

What is AWS CloudWatch?
Amazon CloudWatch is a monitoring and management service that provides insight for applications and infrastructure resources.

Airflow Deployment Choices

Open Source: https://airflow.apache.org/
Amazon Managed Workflow: https://aws.amazon.com/managed-workflows-for-apache-airflow/
Google Cloud Composer: https://cloud.google.com/composer
Astronomer: https://www.astronomer.io/