A Quick Start Guide to Change Data Capture with Debezium

Shekhar Parnerkar
Hashmap, an NTT DATA Company
9 min readMay 3, 2021

Change Data Capture (CDC) refers to sourcing database change events from a source database. Typical data ingestion solutions, such as Sqoop rely upon a check-column (such as “CREATE_DATE” or “LAST_MODIFIED_DATE”) to be present in each and every table. Unfortunately, this is not always true. Moreover, they can’t detect delete events. Therefore, they can’t be used in situations where you want to mirror a logically related set of source tables in a data platform.

This is where CDC fits in. While there are commercial solutions available in the market, Debezium is available as an open-source option. In this blog post, I am going to show you how to install the Debezium MySQL Connector on Ubuntu machines using AWS EC2 instances. So let’s get started!

Step 1 — Getting Ready

In order to execute the steps outlined in this blog post, you will need an AWS account and some familiarity on how to create and connect to your EC2 instances on AWS. On Windows desktops, you will need Putty and WinSCP. MacOS and Linux users can use their Terminal for the same. I also assume that you have created a Key-Pair for AWS account to access your instances.

We will be creating 2 Linux machines on EC2. One will act as MySQL Server and another will host Debezium. Please follow the steps below:

  1. Login to your AWS Console
  2. From EC2 Menu, select Launch Instances
  3. From the Quick Start Menu, select Ubuntu Server 18.04 LTS (HVM), SSD Volume Type 64-bit (x86)
  4. Choose Instance Type t2.micro (default) from the next menu
  5. Click Review and Launch
  6. Go to Edit Security Group -> Add Rule
  7. Select Type — MySQL/Aurora and Source — Anywhere
  8. Launch Instance and go to Instances panel in EC2 Menu
  9. Wait for instance to launch completely until State is Running and Status shows 2/2
  10. Select the instance and edit Instance Name as MySQL Server. Note its Public Ipv4 DNS from pane below. This will serve as our MySQL server.

Create another instance following the same steps as above but choosing instance type ‘t2.small’ (this will cost) and name the instance as ‘Debezium Server’. We will use it to install Debezium, Kafka, and Zookeeper.

Step 2 — Installing and Configuring MySQL Server

Connect to your MySQL-Server using Putty or ssh and follow the steps below to create a Source Database in MySQL for our purpose:

  1. Update Repo:

$ sudo apt-get update

2. Install MySQL Server:

$ sudo apt-get install mysql-server

3. Login to MySQL:

$ sudo mysql -u root -p

4. Specify the root password as ‘root’

5. Create user ‘debezium’ with password ‘dbz’ for our demo purpose:

mysql> GRANT ALL PRIVILEGES ON *.* TO ‘debezium’@’%’ IDENTIFIED BY ‘dbz’;

mysql> FLUSH PRIVILEGES;

6. Check Binary Logging Status with the following commands:

mysql> set @@global.show_compatibility_56=ON;

mysql> SELECT variable_value as “BINARY LOGGING STATUS (log-bin) ::”

-> FROM information_schema.global_variables WHERE variable_name=’log_bin’;

7. By default it is set to OFF but we need to enable it. Logout of mysql shell and edit the mysql configuration file:

$ sudo vi /etc/mysql/mysql.conf.d/mysqld.cnf

8. Then search for the row that contains bind-address and comment it out as follows and save the file. This will enable our MySQL service to be accessed from anywhere:

9. Now, we need to restart MySQL service for our changes to take effect:

$ sudo service mysql restart

10. Login to MySQL shell again and use the following commands to verify that BINARY LOGGING has been set to ON:

mysql> set @@global.show_compatibility_56=ON;

mysql> SELECT variable_value as “BINARY LOGGING STATUS (log-bin) ::”

-> FROM information_schema.global_variables WHERE variable_name=’log_bin’;

Our MySQL server is now set up for Change Data Capture. Initially, when we connect it to it via Debezium Connector, the entire database will be considered as a Change Event and loaded in Kafka. Therefore, I suggest you take a small set of test data — one database, 2–3 tables, and 50–100 rows in each table.

Step 3 — Installing Zookeeper

Connect to your Debezium-Server and follow the steps below:

  1. Install Java-8:

$ sudo apt-get install openjdk-8-jdk

2. Download Zookeeper from Apache Zookeeper site:

$ wget https://mirrors.estointernet.in/apache/zookeeper/zookeeper-3.6.2/apache-zookeeper-3.6.2-bin.tar.gz

3. Extract the Zookeeper tarball:

$ tar -xvzf apache-zookeeper-3.6.2-bin.tar.gz

4. Rename Zookeeper directory for east of use:

$ mv apache-zookeeper-3.6.2-bin.tar.gz zookeeper

5. We need a configuration file to start Zookeeper service. Let’s use the sample default file by copying it as zoo.cfg:

$ mv mv zookeeper/conf/zoo_sample.cfg zookeeper/conf/zoo.cfg

We now move on to install Kafka on same machine.

Step 4 — Installing Kafka

  1. Download Kafka from Apache Kafka site. We will use version 2.5.0 with built-in Scala 2.11:

$ wget https://mirrors.estointernet.in/apache/kafka/2.5.0/kafka_2.12-2.5.0.tgz

2. Extract Kafka Tarball:

$ tar -xvzf kafka_2.12–2.5.0.tgz

3. Rename Kafka directory for ease of use:

$ tar mv kafka_2.12–2.5.0 kafka

Step 5 — Setting Up the Debezium MySQL Connector

  1. Download Debezium Connector Jar files from Maven Central. We will use stable version 1.3 for this exercise:

$ wget https://repo1.maven.org/maven2/io/debezium/debezium-connector-mysql/1.3.1.Final/debezium-connector-mysql-1.3.1.Final-plugin.tar.gz

2. Debezium Connector acts as a Kafka Connector and Kafka stores its connector jars in a specific directory — /kafka/connector. Therefore, we will create this directory and move Debezium contents to that directory:

$ sudo mkdir /kafka

$ sudo mkdir /kafka/connect

$ sudo mv debezium-connector-mysql /kafka/connect

3. Next, we make the required configurations on Kafka side. First, we let Kafka know that its Connect Service should load our Debezium-MySQL-Connector from /kafka/connect. For this let's edit connect-standalone.propereties file in kafka/config directory. Remove the # from the last line and add the plugin path as shown:

$ vi vi kafka/config/connect-standalone.properties

4. Now we configure the Debezium MySQL Connector. Creating a new properties file named connect-debezium-mysql.properties. and paste the contents below into that file. Change highlighted properties to suit your sample data and Save the file in kafka/config directory:

$ vi kafka/config/connect-debezium-mysql.properties

name=test-connector

connector.class=io.debezium.connector.mysql.MySqlConnector

database.hostname=<Private ipv4 DNS of your MySQL EC2 Instance>

database.port=3306

database.user=debezium

database.password=dbz

database.server.id=1

database.server.name=mysql

database.include.list=<Test databases names from your MySQL server>

table.include.list=<Table Names to include database_name.table_name separated by comma>

database.history.kafka.bootstrap.servers=localhost:9092

database.history.kafka.topic=dbhistory.test

include.schema.changes=true

tombstones.on.delete=false

Step 6 — Operation

We are now ready to test Debezium MySQL Connector. For the best learning experience, I suggest that you open multiple Terminal Windows (Say 5) and connect to your Debezium Server EC2 instance from each of these:

Window 1 — Start Zookeeper Service

$ kafka/bin/zookeeper-server-start.sh kafka/config/zookeeper.properties

You should see the service running as above.

Window 2 — Start Kafka Service:

$ kafka/bin/kafka-server-start.sh kafka/config/server.properties

Window 3 — Monitor Topics and Messages

Run the following commands to see if any Kafka Topics are created:

$ kafka/bin/kafka-topics.sh — list — zookeeper localhost:2181

Use following commands to Watch messages published in a topic:

$ kafka/bin/kafka-console-consumer.sh — bootstrap-server localhost:9092 — topic <topic-name> — from-beginning

You can Press CTRL-C anytime to exit the console consumer. Omit –from-beginning option if you want to see only the latest messages. Initially, you will not see any topics. But as we start Debezium-MySQL-Connector in next step, you will see one topic per table and messages flowing into them.

Window 4 — Start Kafka Connect Service with Debezium-MySQL-Connector

$ kafka/bin/connect-standalone.sh kafka/config/connect-standalone.properties kafka/config/connect-debezium-mysql.properties

You should see an output like this:

You can now go back to Window 3 and check that new topics are created. You can watch the messages in the topic to see that your data has been loaded in Kafka topics. Your specific changes are loaded under ‘Payload’ tag.

The following is a sample message generated from a single delete event. The selected part indicates that the captioned row has been deleted.

Window 5 — Login to MySQL To Make Database Changes

In the example above, you will notice that a lot of metadata is sent with each message. This is default behavior of Kafka Connect Service — if we wish to strip the message of the metadata, make the following two changes in the Kafka Connect properties file:

$ vi vi kafka/config/connect-standalone.properties

key.converter.schemas.enable=false

value.converter.schemas.enable=false

Also modify Debezium Connector properties to select another table to get some fresh rows.

$ vi kafka/config/connect-debezium-mysql.properties

Go to Window 4 and press CTRL-C to terminate the Kafka Connect Service and then restart it with the same command.

You can now see an Insert event, with the before tag as null and after the tag having new values. Now modify some value in the same record in MySQL and you will see that both before and after tags have the respective values.

Final Thoughts

This was a quick-start guide to give you an idea of how Debezium works. A production setup of Debezium and Kafka, however, would be significantly different from this lab setup in the following ways:

  1. Use of AVRO instead of JSON significantly reducing the amount of metadata wrapped around each message.
  2. Using REST API to post properties to Kafka Connect server instead of using a properties file. REST makes it possible to configure connectors at run time without having to shut down the service.
  3. Implementation of Consumers for the terminal sink. Since Kafka is only temporary storage for messages, we need to consume these messages and make these changes to materialized tables in a Terminal sink, such as Snowflake or BigQuery.

I will be covering these topics in subsequent blog posts so please follow me or Hashmap on Medium to catch them!

Ready to Accelerate Your Digital Transformation?

At Hashmap, we work with our clients to build better, together.

If you’re considering moving data and analytics products and applications to the cloud or if you would like help and guidance and a few best practices in delivering higher value outcomes in your existing cloud program, please contact us.

Hashmap, an NTT DATA Company, offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our Cloud service offerings. We would be glad to work through your specific requirements. Reach out to us here.

Other Tools and Content You Might Like

Shekhar Parnerkar is a Solution Architect and Delivery Manager at Hashmap, an NTT DATA Company. He specializes in designing, architecting, and building modern cloud data platforms for Hashmap’s global clients.

--

--