Geek Culture
Published in

Geek Culture

Airbyte — Promising Integration tool

Introductory post on Airbyte

The key component of any data pipeline is the extraction of data. Once data is extracted it needs to be loaded and transformed (ELT).

Airbyte is an open-source data integration platform that aims to standardize and simplify the process of Extraction and Loading. Airbyte operates on the principle of ELT, it simply extracts raw data and loads it to destinations, optionally it allows us to do transformation. Transformations are decoupled from the EL phase. It simplifies this process by building connectors between data sources and destinations. It is a plugin-based system where you can quickly your own customized connecter using the Airbyte CDK.

Installation

After launching a new EC2 instance with private or private IP. Edit security group to allow TCP connection on port 8000.

Log out and log in again.

Connect to the host with an ssh tunnel:

Now in your browser open http://localhost:8000

Logical constructs

It is the origin of the data to copy.

It is the place where Airbyte will copy the ingested data.

A connection defines the ingestion process, it has the following components. Apart from source and connection, a connection has the following attributes:

  • Sync Schedule
  • Destination Namespace — schema in destination DBs
  • Sync Modes
  • Optional Transformation — Can add custom transformation with dbt

Infrastructure components

  • WebApp — UI, user interact with this
  • Server — Powers all APIs, and also responds to Octavia cli.
  • Temporal — Acts as the scheduler
  • Worker — Picks works from Temporal queue, controls job parallelism
  • Minio — for local logging, supports S3 as well
  • Job (Source Pod, Destination Pod) — Actual sync job
  • Postgres DB (can be an external DB like RDS) — Stores configurations
  • Pod Sweeper — Does cleanup

Airbyte Job steps

  • Getting specs — This loads the connector details
  • Checking connections — verifies if connectivity is okay or not
  • Discovering schemas — Gets the schema of the source
  • Performing syncs — Sync data between source and destinations, for this it will launch two pods one for the source and another for the destination. One thing to note here though is, source and destination pods don’t talk to each other directly, they do so via worker. Data is exchanged using Socat containers (named pipes).
  • Normalization — If enabled, Airbyte automatically transforms the raw JSON blob (output of sync) to match the format of your destination.

Note all the above steps are done by separate Pods.

  • Full refresh — Overwrite
  • Full refresh — Append
  • Incremental — Append
  • Incremental — Dedupe History

Features of Airbyte

  • Open Source
  • 170+ connectors and 25+ Destinations
  • Support custom connectors
  • Built-in scheduler to allow for varied Sync frequency
  • Integration with airflow and dbt
  • Available on the K8s platform
  • Octavia-cli with YAML template for deployments
  • Support for near real-time CDC

Limitations

  • Not any stable release, still in Alpha
  • Lack of IAM Role-based auth on AWS services, currently it asks for KEYS
  • Lack of native support for Prometheus, it has Open Telemetry recently added.
  • Lack of support for User Access Management
  • Not battle-tested in production can get slow after 2k concurrent jobs.
  • No support for replaying a specific execution instance of a job

Thanks for reading!!!

--

--

A new tech publication by Start it up (https://medium.com/swlh).

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store