Airbyte — Promising Integration tool
Introductory post on Airbyte
The key component of any data pipeline is the extraction of data. Once data is extracted it needs to be loaded and transformed (ELT).
Airbyte is an open-source data integration platform that aims to standardize and simplify the process of Extraction and Loading. Airbyte operates on the principle of ELT, it simply extracts raw data and loads it to destinations, optionally it allows us to do transformation. Transformations are decoupled from the EL phase. It simplifies this process by building connectors between data sources and destinations. It is a plugin-based system where you can quickly your own customized connecter using the Airbyte CDK.
Installation
After launching a new EC2 instance with private or private IP. Edit security group to allow TCP connection on port 8000.
sudo yum update -y
sudo yum install -y docker
sudo service docker start
sudo usermod -a -G docker $USER
sudo systemctl enable docker
Log out and log in again.
sudo wget https://github.com/docker/compose/releases/download/1.26.2/docker-compose-$(uname -s)-$(uname -m) -O /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
mkdir airbyte && cd airbyte
wget https://raw.githubusercontent.com/airbytehq/airbyte/master/{.env,docker-compose.yaml}
docker-compose up -d
Connect to the host with an ssh tunnel:
ssh -i $SSH_KEY -L 8000:localhost:8000 -N -f ec2-user@$INSTANCE_IP
Now in your browser open http://localhost:8000
Logical constructs
Source
It is the origin of the data to copy.
Destination
It is the place where Airbyte will copy the ingested data.
Connection
A connection defines the ingestion process, it has the following components. Apart from source and connection, a connection has the following attributes:
- Sync Schedule
- Destination Namespace — schema in destination DBs
- Sync Modes
- Optional Transformation — Can add custom transformation with dbt
Infrastructure components
- WebApp — UI, user interact with this
- Server — Powers all APIs, and also responds to Octavia cli.
- Temporal — Acts as the scheduler
- Worker — Picks works from Temporal queue, controls job parallelism
- Minio — for local logging, supports S3 as well
- Job (Source Pod, Destination Pod) — Actual sync job
- Postgres DB (can be an external DB like RDS) — Stores configurations
- Pod Sweeper — Does cleanup
Airbyte Job steps
- Getting specs — This loads the connector details
- Checking connections — verifies if connectivity is okay or not
- Discovering schemas — Gets the schema of the source
- Performing syncs — Sync data between source and destinations, for this it will launch two pods one for the source and another for the destination. One thing to note here though is, source and destination pods don’t talk to each other directly, they do so via worker. Data is exchanged using Socat containers (named pipes).
- Normalization — If enabled, Airbyte automatically transforms the raw JSON blob (output of sync) to match the format of your destination.
Note all the above steps are done by separate Pods.
Supported sync modes
- Full refresh — Overwrite
- Full refresh — Append
- Incremental — Append
- Incremental — Dedupe History
Features of Airbyte
- Open Source
- 170+ connectors and 25+ Destinations
- Support custom connectors
- Built-in scheduler to allow for varied Sync frequency
- Integration with airflow and dbt
- Available on the K8s platform
- Octavia-cli with YAML template for deployments
- Support for near real-time CDC
Limitations
- Not any stable release, still in Alpha
- Lack of IAM Role-based auth on AWS services, currently it asks for KEYS
- Lack of native support for Prometheus, it has Open Telemetry recently added.
- Lack of support for User Access Management
- Not battle-tested in production can get slow after 2k concurrent jobs.
- No support for replaying a specific execution instance of a job
Thanks for reading!!!