Efficiently Streaming Data with AirByte: A Guide to Octavia CLI

Keren Finkelstein
Israeli Tech Radar
Published in
5 min readMar 1, 2023

In my previous posts (part 1 and part 2) I described how to ingest data from different sources into BigQuery on a self-hosted AirByte. This was accomplished using AirByte’s intuitive web-based interface, which simplifies the configuration and management of the data pipeline.

Octavia CLI, a sub-project of Airbyte, is an open-source tool that enables the creation and modification of source, destination, and connection configurations through YAML files.

So, why should we use Octavia CLI?

  • Programmatically create resources.
  • Managing Airbyte configurations in Git.
  • Copy resource configurations between Airbyte instances.

This article will illustrate how to use Octavia CLI to create a MySQL source, a BigQuery destination, and manage the connection.

Prerequisite

In this article, it is assumed that the reader has already reviewed my previous posts on AirByte and has followed the instructions provided for:

  • Deploy Airbyte on a GCP Compute Engine instance (describe here).
  • Set up GCP Access, Storage, and BigQuery (describe here).
  • Set up Google Sheets access (describe here)

For the Octavia CLI, you will need Docker and Docker Compose installed.

Install Octavia CLI

Install the latest Octavia version as a bash command. (you should install the same Octavia version as the targeted Airbyte instance to avoid API incompatibility issues)

> curl -s -o- https://raw.githubusercontent.com/airbytehq/airbyte/master/octavia-cli/install.sh | bash

This will pull the octavia-cli image, create an ‘octavia’ alias on your ~/.bashrc file and create a ~/.octavia file.

Airbyte version 0.40.16 or above comes with Basic Authentication for the web app, API, and CLI, so you need to add the AIRBYTE_USERNAME and AIRBYTE_PASSWORD environment variables in the ~/.octavia file.

AIRBYTE_USERNAME=airbyte
AIRBYTE_PASSWORD=password

Bootstrap an Octavia project on a new folder with ‘octavia init’

> mkdir airbyte-configuration && cd airbyte-configuration
> octavia init

The following directory structure will be created:

|-- api_http_headers.yaml
|-- connections
|-- destinations
`-- sources

Adding Connection: Google Sheets → BigQuery

Set up an Airbyte Google Sheets Source

First, you need to get the definition ID of the source.

> octavia list connectors sources | grep mysql
> MySQL airbyte/source-mysql 1.0.21 435bb9a5-7887-4809-aa58-28c27df0d7ad

Create the source:

> octavia generate source 435bb9a5-7887-4809-aa58-28c27df0d7ad my_google_sheet

The CLI creates a ‘my_google_sheet’ folder under sources with a configuration.yaml file

|-- api_http_headers.yaml
|-- connections
|-- destinations
`-- sources
`-- my_google_sheet
`-- configuration.yaml

The YAML file includes all the fields displayed in the web application, along with default values. To ensure proper configuration, it is necessary to provide values for all fields marked as REQUIRED. Any fields marked as OPTIONAL should be reviewed, and their lines edited, commented, or deleted as necessary.

# Configuration for airbyte/source-google-sheets
# Documentation about this connector can be found at https://docs.airbyte.com/integrations/sources/google-sheets
resource_name: "my_google_sheet"
definition_type: source
definition_id: 71607ba1-c0ac-4799-8049-7f4b90dd50f7
definition_image: airbyte/source-google-sheets
definition_version: 0.2.31

# EDIT THE CONFIGURATION BELOW!
configuration:
credentials:
## -------- Pick one valid structure among the examples below: --------
auth_type: "Client" # REQUIRED | string
client_id: ${CLIENT_ID} # SECRET (please store in environment variables) | REQUIRED | string | Enter your Google application's Client ID
client_secret: ${CLIENT_SECRET} # SECRET (please store in environment variables) | REQUIRED | string | Enter your Google application's Client Secret
refresh_token: ${REFRESH_TOKEN} # SECRET (please store in environment variables) | REQUIRED | string | Enter your Google application's refresh token
## -------- Another valid structure for credentials: --------
# auth_type: "Service" # REQUIRED | string
# service_account_info: ${SERVICE_ACCOUNT_INFO} # SECRET (please store in environment variables) | REQUIRED | string | Enter your Google Cloud <a href="https://cloud.google.com/iam/docs/creating-managing-service-account-keys#creating_service_account_keys">service account key</a> in JSON format | Example: { "type": "service_account", "project_id": YOUR_PROJECT_ID, "private_key_id": YOUR_PRIVATE_KEY, ... }
row_batch_size: 200 # OPTIONAL | integer | Number of rows fetched when making a Google Sheet API call. Defaults to 200.
spreadsheet_id: # REQUIRED | string | Enter the link to the Google spreadsheet you want to sync | Example: https://docs.google.com/spreadsheets/d/1hLd9Qqti3UyLXZB2aFfUWDT7BG-arw2xy4HR3D-dwUb/edit

For the credentials, delete the auth_type: ‘client’ section and uncomment the auth_type: ‘service’. Set the value of service_account_info to ${SERVICE_ACCOUT_INFO} and the value of spreadsheet_id to ${SPREATSHEET_ID}.

Once you have finished editing the configuration, it should look the following:

resource_name: "my_google_sheet"
definition_type: source
definition_id: 71607ba1-c0ac-4799-8049-7f4b90dd50f7
definition_image: airbyte/source-google-sheets
definition_version: 0.2.31

configuration:
credentials:
auth_type: "Service"
service_account_info: ${SERVICE_ACCOUNT_INFO}
row_batch_size: 200
spreadsheet_id: ${SPREADSHEET_ID}

Octavia offers secret management through environment variables expansion on configuration files. Let’s add our secrets and sensitive data to the ~/.octavia file.

SERVICE_ACCOUT_INFO should contain the downloaded service account key JSON file content (as in the previous article) and SPREADSHEET_ID should contain the spreadsheet link you want as a data source.

OCTAVIA_ENABLE_TELEMETRY=True
AIRBYTE_USERNAME=airbyte
AIRBYTE_PASSWORD=password

SERVICE_ACCOUT_INFO:********
SPREADSHEET_ID:*********

Set up an Airbyte BigQuery Destination

Same as before, let's get the BigQuery destination definition ID.

> octavia list connectors destinations | grep bigquery
> BigQuery airbyte/destination-bigquery 1.2.13 22f6c74f-5699-40ff-833c-4a879ea40133

And create the destination:

> octavia generate destination 22f6c74f-5699-40ff-833c-4a879ea40133 bigquery_google_sheet

The generated file can be found here:

|-- destinations
| |-- bigquery_google_sheet
| | `-- configuration.yaml

For the credentials, delete the method: ‘Standard’ section and uncomment the method: ‘GCS Staging’ and set values as described here:

resource_name: "bigquery_google_sheet"
definition_type: destination
definition_id: 22f6c74f-5699-40ff-833c-4a879ea40133
definition_image: airbyte/destination-bigquery
definition_version: 1.2.13

configuration:
dataset_id: ${DATASET_ID}
project_id: ${PROJECT_ID}
loading_method:
method: "GCS Staging"
credential:
credential_type: "HMAC_KEY"
hmac_key_secret: ${HMAC_KEY_SECRET}
hmac_key_access_id: ${HMAC_KEY_ACCESS_ID}
gcs_bucket_name: ${GCS_BUCKET}
gcs_bucket_path: ${GCS_BUCKET}/data
keep_files_in_gcs-bucket: Delete all tmp files from GCS
credentials_json: ${SERVICE_ACCOUT_INFO}
dataset_location: EU
transformation_priority: interactive
big_query_client_buffer_size_mb: 15

Add the following configurations to the ~/.octavia file: add the google project ID, dataset ID, GCP bucket name, and the HMAC key and secret you created (see prerequisite section).

DATASET_ID:********
PROJECT_ID:*********
HMAC_KEY_SECRET:*********
HMAC_KEY_ACCESS_ID:*********
GCS_BUCKET:*********

Set up an Airbyte Connection

All we have to do now is create the connection in order to link our MySQL source and our BigQuery destination

> octavia generate connection 
--source sources/my_google_sheet/configuration.yaml
--destination destinations/bigquery_google_sheet/configuration.yaml
my_google_sheet+connection

Create the source, destination, and connection on your AirByte instance. Octavia will validate the configuration against the JSON schemas and fail to apply the changes if any configuration error is found.

> octavia apply

Great, we have successfully established a code-based connection on Airbyte!

Conclusion

You have two options to manage Airbyte resources: the web application or YAML files. By utilizing Configuration as Code, you can guarantee that the configuration is versioned, secrets are secured, and the same configuration can be applied to multiple instances.

It’s important to note that the project is currently in its alpha version.

--

--