Near Real Time Data Replication using Debezium
Part 1 : Introduction and Implementation
Part 2 : A detailed guide to build a data replication pipeline from GCP CloudSQL Postgres to BigQuery
What is Debezium ?
Debezium is an open source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases. Debezium is durable and fast, so your apps can respond quickly and never miss an event. Debezium continuously monitors your databases and lets any of your applications stream every row-level change in the same order they were committed to the database.
What does it do ?
Once configured , Debezium connects to the source database and takes a consistent snapshot of all or selective tables. After that snapshot is complete, the connector continuously captures row-level changes that insert, update, and delete database content that were committed to the source database. The connector generates data change event records and streams them to Kafka topics. For each table, the default behavior is that the connector streams all generated events to a separate Kafka topic for that table. Applications and services can consume data change event records from that topic.
Benefits:
Cost effective
Support to replicate DDL statements*
Selective replication
Data replication among heterogeneous databases
UseCases:
AWS RDS to GCP CloudSQL continuous replication
Azure [MySQL Or Postgres] to GCP CloudSQL [MySQL Or Postgres] Replication
Cross Cloud DR Setup : Replication from RDS Postgres in Mumbai to CloudSQL Postgres in Delhi Region
Architecture:
Hardware Requirements
4 GCE Instances to host:
Zookeeper
Kafka
Confluent platform with source and sink debezium connectors
Monitoring Setup with Grafana + Prometheus
Port Configuration
The following ports must be open:
Installation:
Install JRE
sudo apt-get -y install default-jre
Install kafka & Zookeeper using Confluent Kafka Platform
Install the Confluent public key. This key is used to sign the packages in the APT repository.
wget -qO — https://packages.confluent.io/deb/4.0/archive.key | sudo apt-key add -
Add the repository to your /etc/apt/sources.list by running this command:
sudo apt install software-properties-common
$ sudo add-apt-repository “deb [arch=amd64] https://packages.confluent.io/deb/4.0 stable main”
Update apt-get and install the Confluent Platform platform.
sudo apt-get update && sudo apt-get install confluent-platform
Confluent Platform using only Confluent Community components:
$ sudo apt-get update && sudo apt-get install confluent-platform-oss-2.11
sudo apt-get update && sudo apt-get install confluent-hub-client confluent-common confluent-kafka-2.11
Configuration
Configure Zookeeper:
Its recommended to implement ZooKeeper in a replicated mode. A minimum of three servers are needed for replicated mode, and you must have an odd number of servers for failover. For more information, see the ZooKeeper documentation.
Navigate to the ZooKeeper properties file (/etc/kafka/zookeeper.properties) file and modify as shown.
Node 1: 10.128.0.14
tickTime=2000
dataDir=/var/lib/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=10.128.0.14:2888:3888
server.2=10.128.15.236:2888:3888
server.3=10.128.0.13:2888:3888
autopurge.snapRetainCount=3
autopurge.purgeInterval=24
Node 2: 10.128.15.236
tickTime=2000
dataDir=/var/lib/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=10.128.0.14:2888:3888
server.2=10.128.15.236:2888:3888
server.3=10.128.0.13:2888:3888
autopurge.snapRetainCount=3
autopurge.purgeInterval=24
Node 3: 10.128.0.13
tickTime=2000
dataDir=/var/lib/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=10.128.0.14:2888:3888
server.2=10.128.15.236:2888:3888
server.3=10.128.0.13:2888:3888
autopurge.snapRetainCount=3
autopurge.purgeInterval=24
This configuration is for a three node ensemble. This configuration file should be identical across all nodes in the ensemble.
Navigate to the ZooKeeper log directory (e.g., /var/lib/zookeeper/) and create a file named myid. The myid file consists of a single line that contains the machine ID in the format <machine-id>. When the ZooKeeper server starts up, it knows which server it is by referencing the myid file. For example, server 1 will have a myid value of 1.
Node 1: 10.128.0.14
echo “1” > /var/lib/zookeeper/myid
Node 2: 10.128.15.236
echo “2” > /var/lib/zookeeper/myid
Node 3: 10.128.0.13
echo “3” > /var/lib/zookeeper/myid
Configure Kafka
Navigate to the Apache Kafka properties file (/etc/kafka/server.properties) and customize the following:
Node 1: 10.128.0.14
broker.id.generation.enable=true
delete.topic.enable=true
listeners=PLAINTEXT://:9092
zookeeper.connect=10.128.0.14:2181,10.128.15.236:2181,10.128.0.13:2181
log.dirs=/var/lib/kafka
log.retention.hours=168
num.partitions=1
Node 2: 10.128.15.236
broker.id.generation.enable=true
delete.topic.enable=true
listeners=PLAINTEXT://:9092
zookeeper.connect=10.128.0.14:2181,10.128.15.236:2181,10.128.0.13:2181
log.dirs=/var/lib/kafka
log.retention.hours=168
num.partitions=1
Node 3: 10.128.0.13
broker.id.generation.enable=true
delete.topic.enable=true
listeners=PLAINTEXT://:9092
zookeeper.connect=10.128.0.14:2181,10.128.15.236:2181,10.128.0.13:2181
log.dirs=/var/lib/kafka
log.retention.hours=168
num.partitions=1
Startup Confluent Platform
Enable and Start ZooKeeper
sudo systemctl enable confluent-zookeeper
sudo systemctl start confluent-zookeepersudo systemctl status confluent-zookeeper
Enable and Start Kafka
sudo systemctl start confluent-kafka
sudo systemctl enable confluent-kafka
Enable and Start Schema Registry
sudo systemctl start confluent-schema-registry
sudo systemctl enable confluent-schema-registry
Validate the services are up and running
sudo systemctl status confluent-zookeeper
sudo systemctl status confluent-kafka
sudo systemctl status confluent-schema-registry
Configure Debezium service in Distributed mode
Navigate to the following file (/etc/kafka/connect-distributed.properties) and customize the following:
— On all three nodes:
bootstrap.servers=10.128.0.14:9092,10.128.15.236:9092,10.128.0.13:9092
group.id=connect-cluster
plugin.path=/usr/share/java,/usr/share/confluent-hub-components
Manually Startup the Debezium service:
/usr/bin/connect-distributed /etc/kafka/connect-distributed.properties &
We have successfully installed debezium in a distributed mode on 3 nodes.
Part2 A detailed guide to build a data replication pipeline from GCP CloudSQL Postgres to BigQuery
References:
Till next time , Happy Learning !
Saurabh !