Airbyte — Getting Started with Building Extract & Load Pipelines

Introduction to no-code data pipeline with PostgreSQL, BigQuery, OpenweatherAPI

Ihor Lukianov
Towards Data Engineering
7 min readMay 27, 2024

--

As a data engineer, dealing with ETL processes can be daunting at the start of your career. It’s a key part of our job to create and manage pipelines for transferring data from one source to another. I’m sure you’d want to make this process as convenient as possible.

Data sources generally fall into categories such as databases, APIs, analytics systems like Google Analytics, etc. Many people worldwide perform similar operations and repeat writing the same code. Wouldn’t it be great to have a standardized tool with pre-built connectors, allowing you to adjust workflow according to your needs?

While there are many options in the market, I’d like to highlight an open-source solution Airbyte. It’s a user-friendly tool, that allows you to work in a no-code way. It comes with 350+ connectors, which is enough for most cases. Today, I want to focus on its core functionality and create some Extract and Load pipelines.

Setting up

As mentioned earlier, Airbyte is an open-source tool, which allows you to run it either locally or set it up on your cloud infrastructure. For testing purposes, let’s run Airbyte locally using Docker.

Fortunately, the development team has already provided all the necessary resources. All we need to do is run the following commands to clone the repository and start Airbyte on your docker-compose:

# clone Airbyte from GitHub
git clone --depth=1 https://github.com/airbytehq/airbyte.git

# switch into Airbyte directory
cd airbyte

# start Airbyte
./run-ab-platform.sh

After you have it done, you can find your Airbyte running on http://localhost:8000/. By default, your username and password are airbyte and password.

Our local Airbyte is ready to build connections

Building EL pipelines in Airbyte

Introduction to Airbyte concepts

Let’s begin by understanding some key Airbyte concepts before we dive into building pipelines. These concepts are essential to grasp for a better understanding of the processes that follow.

  • Source — Any data origin such as an API, database, file, etc.
  • Destination — The location where you want to store the data from the source, such as a data lake or data warehouse.
  • Connector — A component of Airbyte that moves data from a source to a destination.
  • Connection — An automated data pipeline used to replicate your data.
  • Record — A single entry of data, for example, a row in a SQL database, with fields serving as attributes of the record.
  • Stream — A general term referring to different forms of data in various destinations. For instance, a table in a relational database is considered a stream in Airbyte.

There are even more core concepts in Airbyte, but this is enough for us to start building our first connector.

Composing our first connector — PostgreSQL to BigQuery

Let’s start by setting up a simple connector between two databases: one as the source and the other as the destination. It’s typical to use PostgreSQL as the transactional data layer. As part of our analytics process, we’ll eventually need this data in our data warehouse, such as BigQuery. This will be our first example to work on.

Creating our connection

Airbyte offers a comprehensive guide for every step in the workflow. To turn our database into a connector, we simply need to input all our details and it will be ready to work. As for the PostgreSQL connector, it works great by inputting relevant credentials.

Setting up a BigQuery connection is a bit more complex, as it requires a user key and ensuring appropriate permissions. I won’t delve into the settings on the cloud side, but it’s important to prepare a separate Cloud Storage bucket because Airbyte uses it before loading data to the BigQuery table. It’s common for data warehouse systems to use object storage as a data ingestion layer.

To have permission to the Cloud Storage, we need to have an HMAC key. This could be done in the settings of the Cloud Storage.

Use your new GCS Bucket and HMAC key for loading

You will need a Service Account Key JSON, which is a part of Google Cloud’s IAM service. You can find instructions in the documentation on how to create a service account with the JSON format key.

After successfully creating both the source and destination, we can proceed to establish the connection between them. This is what our first stream looks like, with all fields being broadcasted to the BigQuery destination.

Stream in our first connection

We have multiple ways to write our data to the destination, such as append and overwrite. I chose to use the overwrite format to test how Airbyte works, but you can always change this option based on your specific use case. In terms of Airbyte, these modes are referred to as Sync Modes. There are four of them provided by the service:

  • Incremental Append + Deduped
  • Full Refresh Overwrite
  • Full Refresh Append
  • Incremental Append

Another great feature of Airbyte is its built-in scheduler. While you may need a separate scheduler like Airflow or Dagster for more complex dataflows, Airbyte’s scheduler may work just fine for smaller cases.

I’ve already initiated my first connection (you can do it with the Sync Now option) and have obtained the initial results. As you can see, our job was successfully created, and data was transferred from the PostgreSQL database to BigQuery. You can spend less than 10 minutes and have a ready-to-go Extract and Load workflow.

Let’s review the BigQuery table to ensure that all the data was added accurately. Airbyte not only loads the data but also includes some additional metadata columns such as UUID and extracted datetime. In all other instances, my data was loaded without any alterations.

Testing with API as a data source

So far, we have successfully extracted and loaded data from one database to another. While using a pre-built connector can save time, this is not the main purpose of using Airbyte. The framework offers many API connectors that can be used quickly.

If you have experience in data extraction from APIs, you know that this process is not the quickest one. In many cases, the raw data needs to be loaded from the API without any transformation at this step. Airbyte can save a lot of time in these processes.

One option is to use the OpenweatherAPI, which allows data extraction for specific locations based on longitude and latitude. The only issue is that you need to obtain an API key from the service. The necessary pricing plan is One Call By Call, which provides 1000 free API calls per day, but you will need to confirm your credit card. I can assure you that we won’t exceed this limit in our upcoming testing approach.

Connection to our source API is ready

And just like that, we quickly completed a simple API extraction. Additionally, we have created a new table in BigQuery using Airbyte, which contains updated weather forecast data. As you may noticed, we defined only one location for our API call, so we have one record loaded to the BigQuery source.

Connection performed correctly

Airbyte’s main drawback in API extraction workflows is its limitation to pre-written methods and a set of parameters. Extracting data from complex sources may necessitate numerous specific settings and additional parameters. While Airbyte is effective for standardized data pipelines, more intricate solutions are still required for specific use cases.

Conclusions

In this article, we provided an introduction to an EL tool that enables the quick use of pre-built connectors and data pipelines. We discussed the major advantages and weaknesses of using such tools for automating data workflows. So, where can we go from here, and how to improve our Airbyte experience?

  • The key is to combine Airbyte with other data engineering tools. Since Airbyte covers the EL part of the data pipeline, dbt might be a good choice for the transformations layer. If you’re interested in learning more about dbt, you can check out my article on it.
  • In this overview, we haven’t used any code, but you may want to recreate your connections later. It’s a good idea to combine Airbyte with Terraform for a complete code-based solution.
  • Additionally, when paired with an orchestration tool like Dagster, you can create really impressive automated solutions.

I hope this brief introduction helps you understand why you might need to use Airbyte or another automated EL tool and what you can build with it.

I can be found on LinkedIn and I am looking forward to connecting with you. Let’s engage in discussions about the intricate world of data science and data engineering.

Stackademic 🎓

Thank you for reading until the end. Before you go:

--

--

Ihor Lukianov
Towards Data Engineering

Data Engineer @Namecheap. Interest in Data Engineering and NLP/NLU problems.