Change Data Capture and Kafka

4 min readJul 26, 2020

Kafka (originally came from Apache) is an open source steaming platform to deliver high performance and idempotent messaging. There are many variants of Kafka available in the world, however, we will consider Confluent Platform for present discussion. I shall try to cover the CDC with Red Hat AMQ sometime in future.

What is CDC?

Change Data Capture (CDC) is a process for capturing, delivering and tracking changes from source datasets to a target (Sink) system or a streaming platform. Traditional data integration pipelines focus on Extract, Transform, and Load (ETL) or ELT for present Big Data architecture where change data capture focuses on extracting data from source system to target.

Where is Kafka in this game?

Kafka is used to ingest and transport data in an ETL/ELT pipeline. Data is read and written by various databases and other end points using Kafka producers and consumers. Low latency, high availability, scalability, and persistence enable Kafka to serve as a highly efficient pipeline to move data for real-time integration requirements.

Integration Patterns

There are many ways to apply data integration architectural pattern, such as Micro service, Data Lake, NoSQL based and Streaming-first approach. Streaming-first is a pattern where data is ingested directly into a streaming platform using CDC and Kafka. Kafka retains data until it can be consumed by other downstream systems. The core of this concept is to put the data into the stream platform first, instead of loading into a traditional data base or data warehouse.

CDC is certainly is seen as a good enabler for data integration for high performance messaging system. There are two popular methods to attain data integration in Kafka — using CDC and Kafka Connect. We will look into the both of it.

There are many CDC processors available in the market. Like IBM Infosphere Data Replication Services(IIDR), Debezium, Attuinity, Oracle Golden Gate to name a few.

Capturing data with CDC

The CDC process consists of two functions: capturing the changed data and enabling replication of the changed data. In our discussion, Kafka becomes the persistence and transit layer in which CDC changes are replicated and delivered.

There are 3 ways we can capture data with CDC.

1. Source System Triggers:

Database triggers are used to capture CUD changes in a separate change-capture table and replicated to Kafka by a CDC processor.

2. Query-based:

A CDC processor makes queries to the source database and extracts changed data from the changed timestamps, version numbers or status columns and copies new data to Kafka.

3. Log-based

The CDC processor scans backup or recovery transaction logs of the source database and identifies changes to be replicated to Kafka.

Log based data capture with CDC is considered to have minimal impact with better efficiency compared to the other two. Sometimes other ways are explored where the log access issues exist.

Getting into the Business

In figure -1 below, records are extracted by a CDC method to be written into Kafka and there are multiple sinks and targets are consuming streaming data. Here, with a simplistic approach, the CDC replication process is shared between the CDC system and Kafka producers. Here, an assumption is, the captured data will be serialized and written to a Kafka topic by the producers. Producers can serialize with Apache Avro, an option for schema serializer, and Schema registry to ensure faster throughput and consistency in data stream pipelines.

The Next Level

Now that we see how the basic arrangement works, we will now place CDC processors with Kafka Connect API (from Confluent) to get rid of the need of configurations, scripting and monitoring the replication process management with a great deal of simplicity. In the Figure — 2, we see how it works and this can be attained with less efforts and adjustments.

The complete replication is now managed with CDC processor like IIDR/Goldengate/Debezium (Opensource) or likes of Attuinity. There are other CDC connectors also available. Simple configuration changes with Kafka Connect for these CDC processors gets rid of additional producer components.

For the consumer side, there are ample options to consume streaming data to various target systems.

In the end

Kafka is just a streaming platform to transfer byte streams. Therefore, source metadata of CDC is very critical to a fully optimized transaction streaming pipeline. Key concerns and considerations here are optimizing data definitions and serialization mechanism for an effective, reliable transaction pipeline.

There is another important design issue to be considered, how the topic schema and partition design should be to meet long standing high performance and latency system. We shall try to look into this in the next blog in near future.

Disclaimer : Views presented here are personal.