Deploying Airbyte on GKE using Helm

Rodel van Rooijen
8 min readMay 31, 2024

--

This is a blog in a series of blogs that describe how to deploy open source tools to Google Kubernetes Engine (GKE). In this series we’ve already described how to deploy Airflow to GKE using Helm, in this blog we’ll dive into deploying Airbyte with a step-by-step tutorial.

In this tutorial I will walk you through how to successfully deploy Airbyte (tested with Airbyte version 0.59.0 and Helm chart 0.54.113) on GKE using the official Helm chart. In my own search to find guides on how to do this in detail, I couldn’t find a detailed guide. Hence I’ve written this guide to hopefully help people in similar situations.

The outcome of this tutorial should be:

  • A GKE cluster that can automatically scale in terms of node size.
  • Airbyte deployed on Kubernetes.
  • An external Postgres database that contains the Airbyte metadata.
  • An external location (bucket) that contains the Airbyte logs.

Note that similar logic to that is described in this tutorial should apply to deploying Airbyte on Amazon EKS and Azure AKS. However, the detailed steps are specific to Google Cloud Platform (GCP). If you want to read up on why pick Kubernetes to host your open source tools, refer to the Airflow blog.

Airbyte vs Airflow

The first question you might ask is; why deploy Airbyte if you already have Airflow? and how is Airbyte different from Airflow?

Let’s start with describing the commonalities, both tools have the ability to orchestrate E(T)L pipelines. However, Airflow is more generic while Airbyte has more depth to it when it comes to “Extract” and “Load” parts. Namely, Airbyte has a broad open source community which builds connectors to extract source data and load that into a destination. To this date Airbyte has 350+ connectors which makes it a go-to-tool when it comes to the theme of data imports and exports from and to various sources.

While you could probably achieve the same thing with Airflow, Airbyte comes with a powerful API that abstracts away a lot of the complexity. In addition with Airbyte a lot of the logic to pull and push data is already pre-built, which in Airflow you (at least partially) have to build yourself. Moreover, the nice thing is that you can orchestrate Airbyte with Airflow, meaning that Airflow can remain the orchestrator while giving you the power of the pre-built connectors. This is the way.

Requirements

This tutorial requires you to have:

  • An existing GCP project.

In addition your GCP account should have:

  • GKE Admin permissions (roles/container.admin)
  • The ability to create service accounts (roles/iam.serviceAccountAdmin)
  • The ability to create a Postgres Cloud SQL instance (roles/cloudsql.admin)
  • Manage Google Cloud Storage (GCS) buckets (roles/storage.admin)

Before continuing you will also have to make sure that you have the Google CLI installed such that you can use gcloud commands. In addition you will need to install kubectl to interact with your GKE cluster. This tutorial will heavily lean on the execution of various gcloud commands, however instead of using the commands you can also achieve most of the below by using the Google Console UI.

Setting things up

To select your relevant project make you can run gcloud config set project <YOUR_PROJECT_ID> . From here on we will assume that your project ID can be accessed in bash using $PROJECT_ID . Throughout this tutorial we will use europe-west1 as the primary GCP location. This tutorial

Service account creation

In this tutorial we will set-up a service account that will be used by Airbyte in your Kubernetes cluster.

To create the service account you can use:

gcloud iam service-accounts create airbyte-kubernetes \
--description="User-managed service account for the Airbyte deployment" \
--display-name="Airbyte Kubernetes"

Then assign the following roles:

gcloud projects add-iam-policy-binding $PROJECT_ID \
--member=serviceAccount:airbyte-kubernetes@$PROJECT_ID.iam.gserviceaccount.com \
--role=roles/container.admin \
--role=roles/iam.serviceAccountUser \
--role=roles/storage.admin

In addition to creating the service account we’ll now create a service account JSON that will be used by Airbyte internally.

gcloud iam service-accounts keys create service-account \
--iam-account=airbyte-kubernetes@$PROJECT_ID.iam.gserviceaccount.com

This will save a service account json in your home ~/ directory. Make sure that you keep this file ready.

Cluster creation

If you already have an existing GKE cluster you’d like to re-use you can skip this step. Make sure you replace “gke-cluster” with applicable cluster name.

The cluster that will be created we will use is contained in a single zone and will by default have a single node. We will also use a standard cluster instead of an auto-pilot cluster to have more control over the deployment, in addition it also is generally cheaper to run small Kubernetes clusters on standard instead of autopilot.

We will create our cluster using the following bash commands:

gcloud container clusters create gke-cluster \
--zone "europe-west1-b" \
--project $PROJECT_ID \
--machine-type n2-standard-4 \
--num-nodes 1 \
--scopes "cloud-platform" \
--autoscaling-profile "optimize-utilization" \
--enable-autoscaling --min-nodes=1 --max-nodes=3 \
--workload-pool $PROJECT_ID.svc.id.goog

Feel free to adjust the max-nodes argument to the maximum nodes the cluster should be able to auto-scale to. Notice that the machine-type is set to n2-standard-4 this is a small machine but should be able to perfectly handle a small Airflow deployment on its own. In addition the scope is set to "cloud-platform" meaning that the Kubernetes cluster will have access to all enabled GCP API’s and the autoscaling profile is set to "optimize-utilization" which is a more aggressive down-scaling policy.

Namespace

After the cluster is successfully created we will create a namespace that will contain the Airflow deployment. But first we will need to obtain the cluster credentials using kubectl. You can use this to do that:

gcloud container clusters get-credentials gke-cluster \
--zone "europe-west1-b" \
--project $PROJECT_ID

From here you can create a namespace called “airflow”:

kubectl create ns "airbyte"

Load balancer

In this tutorial we will also make the webserver API available using an internal load balancer IP. This enables you to call the Airbyte API within your VPC, while not exposing it externally. In addition we will show you how to connect to the Airflow UI locally using port-forwarding. This tutorial will not show you how to expose the Airflow webserver through a public IP or URL.

To reserve a static IP for the Airbyte IP you can run:

gcloud compute addresses create airbyte-api-ip \
--region europe-west1 \
--subnet default
gcloud compute addresses describe airbyte-api-ip

In case you use a different VPC subnet than default you will need to replace this. The internal IP address that is created you will need to save, in this tutorial we assume this can be accessed by INTERNAL_IP = 10.132.0.2.

Cloud SQL

It is recommended to use an external database (like Cloud SQL Postgres) to host your Airbyte (meta) data. This way you can easily back-up the meta data that’s in the database and you make sure that even if something happens to your cluster the meta data is somewhere safe.

To create a non-publicly exposed instance you can run

gcloud sql instances create postgres-airbyte \
--database-version=POSTGRES_13 \
--cpu=1 \
--memory=4GB \
--region=europe-west1 \
--no-assign-ip \
--enable-google-private-path \
--root-password=DB_ROOT_PASSWORD

Make sure you replace the DB_ROOT_PASSWORD with a secure password you save. In addition, make sure you save the internal IP address of the instance, in our case DB_INTERNAL_IP = 10.132.0.1. Note that Airbyte only supports Postgres version 13 at this moment.

Log bucket

The last set-up step is to create a GCS bucket that can host your Airbyte logs. Make sure you save the name of the bucket.

gcloud storage buckets create gs://<YOUR BUCKET NAME> \
--location="europe-west1"

Deploying using Helm

To deploy Airbyte on our newly created Kubernetes cluster we use the official Helm chart. This requires you to install helm, to install helm please refer to the helm docs.

To retrieve the latest Airbyte helm charts you can execute the following helm commands:

## Add this helm repository
helm repo add airbyte https://airbytehq.github.io/helm-charts
## Update your helm repo cache
helm repo update

Adding the service account json secret

For Airbyte to be able to have the right permissions in GCP we use the service account json we created in one of the steps before. Make sure you replace the path with the path to where your service account json is located.

kubectl create secret generic service-account-json \
--from-file=gcp.json=./<PATH>/service_account.json \
--namespace=airbyte

Applying secrets

We’ll need to create one secret that contains the external database password:

apiVersion: v1
kind: Secret
metadata:
name: db-secrets
type: Opaque
stringData:
DATABASE_PASSWORD: <YOUR_DB_PASSWORD>

Custom values

In our case we use the values.yaml that is provided in the official helm repo. This is what we use a base for our custom-values.yaml .

There’s a few adjustments to the base version of the values:

To the global section we change:

serviceAccountName: "airbyte-kubernetes"

And add:

  # -- Environment variables  
env_vars:
STATE_STORAGE_GCS_APPLICATION_CREDENTIALS: "/etc/secrets/gcp.json"
CONTAINER_ORCHESTRATOR_SECRET_NAME: "service-account-json"
CONTAINER_ORCHESTRATOR_SECRET_MOUNT_PATH: "/etc/secrets/"
CONTAINER_ORCHESTRATOR_ENABLED: false

# Database configuration override
database:
# -- Secret name where database credentials are stored
secretName: "db-secrets"
# -- Secret value for database password
secretValue: "DATABASE_PASSWORD"

Note that we set the container orchestrator to false, as this has been giving quite some problems in the set-up. If you get this working, please let me know!

In the storage section we change:

credentials: "/etc/secrets/gcp.json"

In the serviceAccount section we change:

serviceAccount:
# -- Specifies whether a ServiceAccount should be created
create: true
# -- Annotations for service account. Evaluated as a template. Only used if `create` is `true`.
annotations:
iam.gke.io/gcp-service-account: "<SERVICE_ACCOUNT_NAME>@<PROJECT_ID>.iam.gserviceaccount.com"
# -- Name of the service account to use. If not set and create is true, a name is generated using the fullname template.
name: airbyte-kubernetes

In the webapp section we change:

  service:
type: LoadBalancer
port: 8000
# loadBalancerIP: 10.132.0.31
annotations:
cloud.google.com/load-balancer-type: "Internal"

Unfortunately the helm chart does not support setting a pre-fixed load balancer IP, however I expect this to be added soon.

Deploying Airbyte

Now to deploy Airbyte you can run:

helm install \
"airbyte" \
airbyte/airbyte \
--namespace "airbyte" \
--version "0.94.1" \
--values ./custom-values.yaml \
--set global.storage.bucket.log=<YOUR_BUCKET> \
--set global.storage.bucket.state=<YOUR_BUCKET> \
--set global.storage.bucket.workloadOutput=<YOUR_BUCKET> \
--set webapp.service.loadBalancerIP=<LOAD_BALANCER_IP> \
--set externalDatabase.host=<DATABASE_IP>

This should deploy Airbyte to your GKE cluster.

Good to knows

Accessing the web server

You can access the web server locally by using port forwarding. For example by using

kubectl port-forward svc/airbyte-airbyte-webapp-svc 8081:8080 --namespace airbyte 2>&1 >/dev/null &

This will make the Airbyte webserver available locally at localhost:8081.

Pod disruption budgets

The GKE cluster we set-up will auto-scale horizontally (more nodes/VMs) when there’s not enough resources. It will also downscale if there’s too much resources. However, in my experience the downscaling is sometimes hindered by kube-system pods. These pods cannot be evicted hence nodes will not be able to downscale. Therefore, we’ll add pod disruption budgets for this.

kube-dns-pdb.yaml

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: kube-dns-pdb
namespace: kube-system
spec:
maxUnavailable: 1
selector:
matchLabels:
k8s-app: kube-dns

metrics-server-pdb.yaml

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: metrics-server-pdb
namespace: kube-system
spec:
maxUnavailable: 1
selector:
matchLabels:
k8s-app: metrics-server

Note that making a PDB on the metrics server will have an effect on a node being able to vertically scale if the metrics server is killed.

Apply these by running:

kubectl apply -f ./metrics-server-pdb.yaml

kubectl apply -f ./kube-dns-pdb.yaml

As promised: Github repo

As promised in the last blog, I’ve created a Github repo that contains all the code necessary to deploy both Airflow and Airbyte to GKE.

I hope you’ve enjoyed this tutorial. Soon I’ll be writing up more content on how to deploy open source tools to Kubernetes. Next up will be Grafana, the go-to-tool when it comes to (real-time) monitoring dashboards.

--

--