Streaming Data from Multiple Sources Using AirByte (Part 1)

Keren Finkelstein
Israeli Tech Radar
Published in
4 min readFeb 13, 2023

Streaming data from multiple sources can lead to data inconsistencies, duplicates, and errors due to differences in format, structure, and definitions of data elements. Additionally, integrating data from multiple sources requires significant technical resources, including data mapping, transformation, and cleaning, which can add complexity to the process and increase the risk of errors. Maintaining data quality and ensuring that data is up-to-date and secure in a rapidly changing environment can also be challenging.

Airbyte is a data integration platform that streamlines the process of collecting and integrating data from multiple sources. It provides an easy-to-use solution that automates complex tasks, saving time and resources compared to manual methods. With Airbyte, users can connect to multiple sources, extract and transform data, and load it into a target warehouse, with the help of a simple UI and flexible, scalable architecture. The platform offers robust error handling, monitoring, and reporting features for data quality and reliability.

In this part, I will describe how to create a BigQuery destination on a self-hosted AirByte.

Prerequisite

Deploy Airbyte on a GCP Compute Engine instance.

  • ssh to your VM instance
  • Install Docker on your VM instance
sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates curl gnupg2 software-properties-common
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo apt-key add --
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian buster stable"
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io
sudo usermod -a -G docker $USER
  • Install docker-compose on your VM instance
sudo apt-get -y install docker-compose-plugin
docker compose version
  • In your VM terminal, install Airbyte
mkdir airbyte && cd airbyte
curl -sOO https://raw.githubusercontent.com/airbytehq/airbyte-platform/main/{.env,flags.yml,docker-compose.yaml}
docker compose up -d
  • In your local terminal, create an SSH tunnel to connect the GCP instance to Airbyte (set your ssh public key, user, and VM IP)
ssh -i <path_to_ssh_pub_key> <user>@<vm_instance_ip> -L 8000:localhost:8000 -N -f

Set up GCP Access, Storage, BigQuery

  • Select the organization from your Google cloud console on the ‘Select organization’ drop-down list at the page's top, and create a new project.
  • On your project dashboard, navigate to IAM & Admin → Service Account to create a service account. Define your service account name and ID and select ‘Create’. On the Roles, select the ‘Storage Object Admin’ role, and click on done.
  • Go to IAM & Admin → IAM → Permissions → add your service account as a project owner. By granting the service account ownership of the project, you are giving it the necessary permissions to perform operations such as creating and managing resources, updating configurations, and deleting resources within the project.
  • On your project dashboard, search and select ‘Cloud Storage’. Click the Create Bucket button to set up your storage bucket. Set a name and a region for your bucket, and select the standard storage class, default access control, and default data protection. This should be done for each of the AirByte destinations
  • Create an HMAC key access ID and Secret: On the Buckets dashboard, select settings → Interoperability → ‘create a key for a service account’. Save the created HMAC access key and secret in a secured place.
  • Navigate to BigQuery and under your project id create a dataset for each one of the AirByte destinations.

Set up an Airbyte BigQuery Destination

  • Set a name for the destination, and add the google project ID, dataset region, and ID.
  • Select ‘GCS Staging’ as ‘Loading Method’ and enter the HMAC key and secret you created previously.
  • Set the bucket name (as created on GCP), and bucket path
  • Set the content of the JSON file containing the service account key.

What’s Next

In the next part, I will demonstrate how to establish a connection in Airbyte.

--

--