Streamlining ETL Workflows with Apache Airflow and Airbyte

Published in

Apache Airflow

6 min readMay 28, 2024

Discover the seamless integration of Apache Airflow and Airbyte for efficient ETL, enhancing scalability and maintainability in data engineering.

After creating, inheriting, and reviewing several Airflow projects as a consultant, I’ve seen various use cases in companies that leverage Apache Airflow. One of the most common is traditional Extract, Transform, and Load (ETL) pipelines. This blog post will outline what an ETL pipeline accomplishes, the hard/easy ways of ETL’ing, and some additional considerations when orchestrating ETL with Apache Airflow.

What is ETL?

Suppose you still need to become familiar with ETL/ELT pipelines. In that case, they allow us to get the data from various sources into a centralized database (commonly called a Data Warehouse).

Why is ETL Important?

ETL pipelines are usually particular to a business’s tooling. For example, a company may use Braze for email marketing campaigns. Braze has all kinds of metrics available within its platform (i.e., Campaign Email Open Rate, Campaign Email Through Rate, Campaign Email Unsubscribe Rate, etc.), but what if there’s a need to combine customer data with this information (i.e., customer age, revenue generated by the customer, products purchased by the customer). Many of these data points could live in various systems outside of Braze. For example, they may exist in the company’s CRM tool (i.e., Salesforce, HubSpot), or the company may have an operational database containing this data. This problem is why many companies perform ETL/ELT to create a centralized repository and perform data analytics. After collecting data from all the various sources into a centralized repository, it is ready to improve operational efficiency, predict business outcomes, measure results with dashboards and visualizations, and so on.

ETL - The Hard Way 😮‍💨

Now that we’ve discussed why companies perform ETL operations let’s discuss how they typically do it with Apache Airflow. Continuing the Braze example above, pretend we’re using a Snowflake Data Warehouse.

It is prevalent for Airflow users and engineers to write their own Python that performs these ETL jobs. Specifically, this means that the Python will:

Extract the data from the Braze API into a Data Lake (i.e., An object storage service like S3 or directly into the Data Warehouse.)
Transform the extracted data so that it is normalized
Load the data into our tidy Snowflake Data Warehouse

In this scenario, the data typically flows like this (Braze API → Object Storage (like S3) → Snowflake).

In Apache Airflow, an HTTP provider will help you initialize a connection to the Braze API for the extract step. However, it is up to the developer to write the code that pings the actual endpoints and paginates them. Not to mention that when that code breaks or Braze updates one of its API endpoints, that code will need to be updated.

ETL - The Easy Way 😎

Another cheesy DALL-E 3 Generated Image 🧀

As you’ve probably inferred from the title of this blog post, we found an easier way to accomplish our Extracts. Our method made it extremely easy to maintain our pipelines and create new ETL pipelines. This method involved moving our Extracts into a low-code tool (designed explicitly for Extracts) and using Apache Airflow to orchestrate the low-code tool.

This low-code tool (for us) is Airbyte. There are a few tools like Airbyte, so let’s discuss why we chose It. We went with Airbyte because it is:

✅ open-source (I’m a fan of that)
✅ has an Airflow integration
✅ is reasonably priced (compared to its competition)

We started by creating an Airbyte Cloud account, but if you prefer Open-Source hosting, Airbyte can also be installed on a Kubernetes Cluster or VM. No matter which route you take for hosting, you can use the Airbyte UI to connect Braze and Snowflake from there. It took us less than an hour to create an account, connect Braze, and load data into Snowflake.

Creating the Braze Source

When creating the Source, Airbyte’s UI will ask you for the credentials to connect to the Source (Braze) and the endpoints from which you’d like to extract data.

Airbyte Braze Source — Finding Braze Source in Braze UI

Airbyte Braze Source Parameters — Entering Braze Credentials

Creating the Snowflake Destination

When creating the Destination, Airbyte’s UI will ask you for the credentials to connect to the Destination (Snowflake).

Airbyte Snowflake Destination Setup — Entering Snowflake Credentials

Creating the Braze → Snowflake Connection

When creating the Connection, Airbyte’s UI will show you the list of endpoints available for the Source (Braze), and you can pick and choose what data you’d like to load into your Destination (Snowflake). Some things that are noteworthy about the process:

Airbyte has incremental loading pre-configured for popular data sources to help you save on Data Warehousing costs.
You can set a CRON schedule for how frequently you’d like an Airbyte Sync to run. We set our CRON schedules to Manual because we use Apache Airflow to orchestrate them.

Braze to Snowflake Airbyte Connection — Selecting the data to be loaded

Orchestrating the Airbyte Sync via Airflow

Now that we had created our Airbyte Connection, it was time to add the task to our Airflow DAG to run them as part of our pipeline:

from airflow import DAG
from airflow.providers.airbyte.operators.airbyte import AirbyteTriggerSyncOperator

with DAG(
    dag_id="airbyte_example",
    start_date=datetime(2023, 6, 20),
    schedule="@hourly"
):

AirbyteTriggerSyncOperator(
    task_id="trigger_airbyte_sync", 
    connection_id="<connection-id-from-airbyte-connection-url>", 
    deferrable=True,
    # api_type="cloud" # if you are using Airbyte Cloud, then uncomment
)

This code assumes that you’ve already:

pip installed apache-airflow-providers-airbyte into your Airflow environment
created a airbyte_default connection in your Airflow connections. See Airflow Documentation for that connection here.

That’s it! It made a lot more sense to maintain a few lines of code vs. building an extensive Python Script to paginate through all of the endpoints and extract them. If I were manually doing this ETL with Python, I’d have had to:

🫠 Write logic that performs a complete reset of the data
🫠 Write logic that incrementally loads the data
🫠 Write unit tests for the Extracts
🫠 Document my code for my team

Instead of doing that, I can create the Connection in the Airbyte UI and add a few lines of code to my Airflow deployment. 🦸

Additionally, I can now set dependencies between the other tasks that Airflow is orchestrating and Airbyte (i.e., SQL transformation, Data Science, Data Quality Checks, Data Cataloging, etc.)

Considerations

The last thing I’ll say here is that there will always be another data source your team will need to extract into your company’s data warehouse. On average, companies adopt 5 new tools every year. Using a tool like Airbyte makes the turnaround time for ETL’ing that data much quicker than writing your scripts to extract the data.

However, as you explore Airbyte’s integration library, you may find that Airbyte lacks the necessary integration(s). The good news here is that you still have some options:

Use Airbyte’s Custom Connector Build Tool to build your connector
Airbyte is open-source, so you could always contribute the connector you need to their project.
Fall back to using Apache Airflow to build unsupported Extracts.

Conclusion

In conclusion, integrating tools like Apache Airflow and Airbyte significantly streamlines ETL processes. Airbyte simplifies the creation and maintenance of ETL pipelines, making the process more efficient and manageable. This integration not only speeds up the incorporation of new data sources but also enhances the overall scalability of your data ecosystem.

Utilizing low-code solutions within an Airflow framework is a powerful tool for your data engineering workflows, enabling your data team to operate with greater agility and efficiency.