Simple CDC Service using Debezium and Apache Kafka
Behind Story
In many cases, data from applications, or interfaces, stored in an OLTP database right after it was captured. Because of the analytical needs, sometimes, we need to move the data into an analytic-base database OLAP. The problem is, the query used to fetch the data, consumes too many memories and makes the application going slow.
This figure shows data transaction trend within last 30 days (in thousand).
The solution we will explain in this article is to stream any data changes in OLTP database to OLAP database using a Change Data Capture (CDC).
What’s This Article Goal?
This article will explain on how to use a simple Change Data Capture (CDC) Debezium to capture any data changes from an OLTP database MySql, and use the data-stream information to synchronize the data with OLAP database.
Architecture
This figure shows complete architecture of the solution.
List of services in this solution were listed below.
- MySql as a sample of OLTP database.
- Debezium as CDC tools.
- Apache Kafka as message broker.
- A Python base Kafka consumer.
- ClickHouse as a sample of OLAP database.
All of the services run inside docker container, which predefined on docker-compose.yaml
script.
Build
- Clone the source code from GitHub Repository.
git clone https://github.com/ilhamhanif/simple-cdc-service.git
cd simple-cdc-service
- Setup and activate a Phyton Virtual Environment.
python -m venv .
. /bin/activate
- Install some Python modules.
Important Python modules listed in `requirements.txt
pip install -r requirements.txt
- Build and start all services.
Make sure docker is already installed using command below.
docker --version
If docker was not installed, follow this documentation for installation process.
To build all services, use command below.
docker compose build && docker compose up -d
Note:
- When command
-d
used, there will be no build and log printed in terminal. To access log for specific service, usedocker container logs [service name]
Demonstration
Below is an order of what demonstration we will do.
- Inspecting source database MySql and target database Clickhouse.
- Creating debezium connector configuration.
- Running the data sequencer.
- Testing the data consistency between 2 database.
- Cleaning Up
1. Inspecting Source Database MySql and Target Database ClickHouse
Before we start, we have to check source database MySql and target database ClickHouse, to make sure, our initial script for each database is run successfully.
MySql Database
A database dev and a table invoice is already preinstalled by script in ./mysql/init-script/script.sql
. Database was accessible thru localhost:3306
with username root and password root as defined in docker-compose.yaml
file. I checked using tools Dbeaver.
ClickHouse Database
A database dev and a table invoice and kafka_message_log is already preinstalled by script in ./clickhouse/init-script/db-table-init.sql
. Database was accessible thru localhost:8123
with username default
and without any password as defined in docker-compose.yaml
file. I checked using same tools Dbeaver.
2. Creating a Debezium Connector Configuration
Debezium needs a configuration, contains information listed below.
- Database location
- Data changes included operation
- Kafka location
To simplify the process, i have made a simple bash script debezium-config-push.sh
. This script basicly does a HTTP POST request using curl to debezium config push endpoint in localhost:8123
with the connector configuration as request body.
Run the script with command below.
bash debezium-config.push.sh
Make sure we have the same response here.
Explanation:
- Connector name: mysql-schema-dev-connector
- Source database location: mysql:3306 with login username root and password root
- Database name: dev
- Kafka location: kafka:9092
- Kafka topic name prefix: source By default, debezium will made a topic with format
source.[database-name].[table-name]
after configuration pushed, therefore, each table in each database, will have a different topic. - Operation included: all By default, TRUNCATE operation ignored by debezium.
- Other configuration is follows this documentation.
3. Running the Data Sequencer
Sequencer was made to automatically did multiple random DML (1800 DML) statement of INSERT/UPDATE/DELETE record to MySql database. The sequencer had written using Python as sequencer.py
. Run the script using command below.
python sequencer.py
Any database record changes will be captured by Debezium, before forwarded to Kafka. A Kafka subscriber service able to fetch data forwarded to Kafka via its subscriber, and do action corresponds the data it capture.
- Insert all original message from kafka into tabel dev.kafka_message_log.
- Using information from the message, perform INSERT/UPDATE/DELETE record in table dev.invoice corresponds with the data.
Example of kafka message for each method store in ./kafka-consumer-service/example/
.
4. Testing Data Consistency between 2 Database
After sequencer finished, database ClickHouse and MySql will had EXACTLY SAME records under the same table.
I’ve made a simple python test script in /tests/test_consistency.py
using pytest. This test script had 2 tests.
- Check row count between 2 table in database. It must be EXACTLY SAME.
- Check each row between 2 table in database. It must be EXACTLY SAME.
Test command:
pytest -vv tests/
Success test was shown in this image.
5. Cleaning Up
Use command below to clean up everything.
docker compose down
Use command below to deactivate the Virtual Environment.
deactivate
Reference
All references (but not limited to) to build this article.