Data Ingestion (Change Data Capture)

Oladayo
2 min readMar 19, 2023

--

Hi Everyone,

Welcome to the third part of the four-part series on data ingestion. You can read the first part here and the second part here.

In this post, I will be going over the change data capture approach to data ingestion.

In the full data ingestion and the incremental data ingestion;

  1. I had to write python scripts to enable the data ingestion.
  2. In production, the python scripts will be deployed to a Function as a Service (FaaS) (such as Google Cloud Function) and a cloud scheduler enabled on the function or one can use Airflow to orchestrate the workflow. This makes both the full data ingestion and incremental data ingestion a batch data ingestion. (data ingestion happens at specified intervals).

What if

  1. I didn’t have to write a python script to carry out the data ingestion.
  2. the data ingestion happens in real-time.

That’s where the change data capture approach comes in.

Change Data Capture

As the name suggests, the change data capture approach of data ingestion captures change events (such as inserting new rows, updating existing rows and deleting existing rows) in a source system (such as MySQL) database. It uses messaging system such as Kafka to stream these changes in real-time.

The source system (such as MySQL) produces these change events to a Kafka topic.

The sink system (such as BigQuery) consumes the change events from the Kafka topic to complete the data ingestion process.

a simple architecture of data ingestion using change data capture

There are two types of change data capture:

Log-based CDC

In log-based CDC, change events in the source system are captured from the log. MySQL has the binary log (binlog) and PostgreSQL has write-ahead logging (WAL) from which change events in the database will be captured.

Debezium is a popular open-source distributed platform for log-based CDC. It’s built on top of Confluent Kafka services :

  1. Apache Kafka: is a distributed streaming platform
  2. Apache Zookeeper: manages the distributed environment and handles configuration across each service.
  3. Apache Kafka Connect: is used to connect Kafka with other systems so data can be easily streamed via Kafka.

Query-based CDC

In query-based CDC, change events in the source system database are captured from queries. Queries as specified are executed against tables in the database.

In the last part of this series, I will demonstrate the log-based change data capture approach to data ingestion.

Thank you for reading.

References

Data Pipeline Pocket Reference by James Densmore (Get it here)

--

--

Oladayo

data 📈, space 🚀🛰, augmented reality 👓 and photography 📸 enthusiast.