Simple CDC Service using Debezium and Apache Kafka

Muhammad Ilham H
EDTS
Published in
5 min readMay 3, 2024

Behind Story

In many cases, data from applications, or interfaces, stored in an OLTP database right after it was captured. Because of the analytical needs, sometimes, we need to move the data into an analytic-base database OLAP. The problem is, the query used to fetch the data, consumes too many memories and makes the application going slow.

This figure shows data transaction trend within last 30 days (in thousand).

The solution we will explain in this article is to stream any data changes in OLTP database to OLAP database using a Change Data Capture (CDC).

What’s This Article Goal?

This article will explain on how to use a simple Change Data Capture (CDC) Debezium to capture any data changes from an OLTP database MySql, and use the data-stream information to synchronize the data with OLAP database.

Architecture

This figure shows complete architecture of the solution.

List of services in this solution were listed below.

All of the services run inside docker container, which predefined on docker-compose.yaml script.

Build

  • Clone the source code from GitHub Repository.
git clone https://github.com/ilhamhanif/simple-cdc-service.git
cd simple-cdc-service
  • Setup and activate a Phyton Virtual Environment.
python -m venv .
. /bin/activate
  • Install some Python modules.

Important Python modules listed in `requirements.txt

pip install -r requirements.txt
  • Build and start all services.

Make sure docker is already installed using command below.

docker --version

If docker was not installed, follow this documentation for installation process.

To build all services, use command below.

docker compose build && docker compose up -d

Note:

  • When command -d used, there will be no build and log printed in terminal. To access log for specific service, use docker container logs [service name]

Demonstration

Below is an order of what demonstration we will do.

  1. Inspecting source database MySql and target database Clickhouse.
  2. Creating debezium connector configuration.
  3. Running the data sequencer.
  4. Testing the data consistency between 2 database.
  5. Cleaning Up

1. Inspecting Source Database MySql and Target Database ClickHouse

Before we start, we have to check source database MySql and target database ClickHouse, to make sure, our initial script for each database is run successfully.

MySql Database

A database dev and a table invoice is already preinstalled by script in ./mysql/init-script/script.sql. Database was accessible thru localhost:3306 with username root and password root as defined in docker-compose.yaml file. I checked using tools Dbeaver.

ClickHouse Database

A database dev and a table invoice and kafka_message_log is already preinstalled by script in ./clickhouse/init-script/db-table-init.sql. Database was accessible thru localhost:8123 with username default and without any password as defined in docker-compose.yaml file. I checked using same tools Dbeaver.

2. Creating a Debezium Connector Configuration

Debezium needs a configuration, contains information listed below.

  1. Database location
  2. Data changes included operation
  3. Kafka location

To simplify the process, i have made a simple bash script debezium-config-push.sh. This script basicly does a HTTP POST request using curl to debezium config push endpoint in localhost:8123 with the connector configuration as request body.

Run the script with command below.

bash debezium-config.push.sh

Make sure we have the same response here.

Explanation:

  • Connector name: mysql-schema-dev-connector
  • Source database location: mysql:3306 with login username root and password root
  • Database name: dev
  • Kafka location: kafka:9092
  • Kafka topic name prefix: source By default, debezium will made a topic with format source.[database-name].[table-name] after configuration pushed, therefore, each table in each database, will have a different topic.
  • Operation included: all By default, TRUNCATE operation ignored by debezium.
  • Other configuration is follows this documentation.

3. Running the Data Sequencer

Sequencer was made to automatically did multiple random DML (1800 DML) statement of INSERT/UPDATE/DELETE record to MySql database. The sequencer had written using Python as sequencer.py. Run the script using command below.

python sequencer.py

Any database record changes will be captured by Debezium, before forwarded to Kafka. A Kafka subscriber service able to fetch data forwarded to Kafka via its subscriber, and do action corresponds the data it capture.

  1. Insert all original message from kafka into tabel dev.kafka_message_log.
  2. Using information from the message, perform INSERT/UPDATE/DELETE record in table dev.invoice corresponds with the data.

Example of kafka message for each method store in ./kafka-consumer-service/example/.

4. Testing Data Consistency between 2 Database

After sequencer finished, database ClickHouse and MySql will had EXACTLY SAME records under the same table.

I’ve made a simple python test script in /tests/test_consistency.py using pytest. This test script had 2 tests.

  1. Check row count between 2 table in database. It must be EXACTLY SAME.
  2. Check each row between 2 table in database. It must be EXACTLY SAME.

Test command:

pytest -vv tests/

Success test was shown in this image.

5. Cleaning Up

Use command below to clean up everything.

docker compose down

Use command below to deactivate the Virtual Environment.

deactivate

--

--