Near Real Time Data Replication using Debezium

Part 1 : Introduction and Implementation
Part 2 : A detailed guide to build a data replication pipeline from GCP CloudSQL Postgres to BigQuery

Published in

Google Cloud - Community

5 min readFeb 10, 2022

What is Debezium ?

Debezium is an open source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases. Debezium is durable and fast, so your apps can respond quickly and never miss an event. Debezium continuously monitors your databases and lets any of your applications stream every row-level change in the same order they were committed to the database.

What does it do ?

Once configured , Debezium connects to the source database and takes a consistent snapshot of all or selective tables. After that snapshot is complete, the connector continuously captures row-level changes that insert, update, and delete database content that were committed to the source database. The connector generates data change event records and streams them to Kafka topics. For each table, the default behavior is that the connector streams all generated events to a separate Kafka topic for that table. Applications and services can consume data change event records from that topic.

Benefits:

Cost effective
Support to replicate DDL statements*
Selective replication
Data replication among heterogeneous databases

UseCases:

AWS RDS to GCP CloudSQL continuous replication
Azure [MySQL Or Postgres] to GCP CloudSQL [MySQL Or Postgres] Replication
Cross Cloud DR Setup : Replication from RDS Postgres in Mumbai to CloudSQL Postgres in Delhi Region

Architecture:

Hardware Requirements

4 GCE Instances to host:
Zookeeper
Kafka
Confluent platform with source and sink debezium connectors
Monitoring Setup with Grafana + Prometheus

Port Configuration

The following ports must be open:

Installation:

Install JRE

sudo apt-get -y install default-jre

Install kafka & Zookeeper using Confluent Kafka Platform

Install the Confluent public key. This key is used to sign the packages in the APT repository.

wget -qO — https://packages.confluent.io/deb/4.0/archive.key | sudo apt-key add -

Add the repository to your /etc/apt/sources.list by running this command:

sudo apt install software-properties-common
$ sudo add-apt-repository “deb [arch=amd64] https://packages.confluent.io/deb/4.0 stable main”

Update apt-get and install the Confluent Platform platform.

sudo apt-get update && sudo apt-get install confluent-platform

Confluent Platform using only Confluent Community components:

$ sudo apt-get update && sudo apt-get install confluent-platform-oss-2.11
sudo apt-get update && sudo apt-get install confluent-hub-client confluent-common confluent-kafka-2.11

Configuration

Configure Zookeeper:

Its recommended to implement ZooKeeper in a replicated mode. A minimum of three servers are needed for replicated mode, and you must have an odd number of servers for failover. For more information, see the ZooKeeper documentation.

Navigate to the ZooKeeper properties file (/etc/kafka/zookeeper.properties) file and modify as shown.

Node 1: 10.128.0.14

tickTime=2000
dataDir=/var/lib/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=10.128.0.14:2888:3888
server.2=10.128.15.236:2888:3888
server.3=10.128.0.13:2888:3888
autopurge.snapRetainCount=3
autopurge.purgeInterval=24

Node 2: 10.128.15.236

tickTime=2000
dataDir=/var/lib/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=10.128.0.14:2888:3888
server.2=10.128.15.236:2888:3888
server.3=10.128.0.13:2888:3888
autopurge.snapRetainCount=3
autopurge.purgeInterval=24

Node 3: 10.128.0.13

tickTime=2000
dataDir=/var/lib/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=10.128.0.14:2888:3888
server.2=10.128.15.236:2888:3888
server.3=10.128.0.13:2888:3888
autopurge.snapRetainCount=3
autopurge.purgeInterval=24

This configuration is for a three node ensemble. This configuration file should be identical across all nodes in the ensemble.

Navigate to the ZooKeeper log directory (e.g., /var/lib/zookeeper/) and create a file named myid. The myid file consists of a single line that contains the machine ID in the format <machine-id>. When the ZooKeeper server starts up, it knows which server it is by referencing the myid file. For example, server 1 will have a myid value of 1.

Node 1: 10.128.0.14

echo “1” > /var/lib/zookeeper/myid

Node 2: 10.128.15.236

echo “2” > /var/lib/zookeeper/myid

Node 3: 10.128.0.13

echo “3” > /var/lib/zookeeper/myid

Configure Kafka

Navigate to the Apache Kafka properties file (/etc/kafka/server.properties) and customize the following:

Node 1: 10.128.0.14

broker.id.generation.enable=true
delete.topic.enable=true
listeners=PLAINTEXT://:9092
zookeeper.connect=10.128.0.14:2181,10.128.15.236:2181,10.128.0.13:2181
log.dirs=/var/lib/kafka
log.retention.hours=168
num.partitions=1

Node 2: 10.128.15.236

broker.id.generation.enable=true
delete.topic.enable=true
listeners=PLAINTEXT://:9092
zookeeper.connect=10.128.0.14:2181,10.128.15.236:2181,10.128.0.13:2181
log.dirs=/var/lib/kafka
log.retention.hours=168
num.partitions=1

Node 3: 10.128.0.13

broker.id.generation.enable=true
delete.topic.enable=true
listeners=PLAINTEXT://:9092
zookeeper.connect=10.128.0.14:2181,10.128.15.236:2181,10.128.0.13:2181
log.dirs=/var/lib/kafka
log.retention.hours=168
num.partitions=1

Startup Confluent Platform

Enable and Start ZooKeeper

sudo systemctl enable confluent-zookeeper
sudo systemctl start confluent-zookeeper
sudo systemctl status confluent-zookeeper

Enable and Start Kafka

sudo systemctl start confluent-kafka
sudo systemctl enable confluent-kafka

Enable and Start Schema Registry

sudo systemctl start confluent-schema-registry
sudo systemctl enable confluent-schema-registry

Validate the services are up and running

sudo systemctl status confluent-zookeeper

sudo systemctl status confluent-kafka

sudo systemctl status confluent-schema-registry

Configure Debezium service in Distributed mode

Navigate to the following file (/etc/kafka/connect-distributed.properties) and customize the following:

— On all three nodes:

bootstrap.servers=10.128.0.14:9092,10.128.15.236:9092,10.128.0.13:9092
group.id=connect-cluster
plugin.path=/usr/share/java,/usr/share/confluent-hub-components

Manually Startup the Debezium service:

/usr/bin/connect-distributed /etc/kafka/connect-distributed.properties &

We have successfully installed debezium in a distributed mode on 3 nodes.

Part2 A detailed guide to build a data replication pipeline from GCP CloudSQL Postgres to BigQuery

References:

Install Confluent Platform

Till next time , Happy Learning !

Saurabh !

Near Real Time Data Replication using Debezium

Part 1 : Introduction and ImplementationPart 2 : A detailed guide to build a data replication pipeline from GCP CloudSQL Postgres to BigQuery

What is Debezium ?

What does it do ?

Benefits:

UseCases:

Architecture:

Hardware Requirements

Port Configuration

Installation:

Install JRE

Install kafka & Zookeeper using Confluent Kafka Platform

Configuration

Configure Zookeeper:

Node 1: 10.128.0.14

Node 2: 10.128.15.236

Node 3: 10.128.0.13

Node 1: 10.128.0.14

Node 2: 10.128.15.236

Node 3: 10.128.0.13

Configure Kafka

Node 1: 10.128.0.14

Node 2: 10.128.15.236

Node 3: 10.128.0.13

Startup Confluent Platform

Enable and Start ZooKeeper

Enable and Start Kafka

Enable and Start Schema Registry

Validate the services are up and running

Configure Debezium service in Distributed mode

Manually Startup the Debezium service:

References:

Written by Saurabh Gupta

Part 1 : Introduction and Implementation
Part 2 : A detailed guide to build a data replication pipeline from GCP CloudSQL Postgres to BigQuery