Streaming Data Changes with Debezium to Build Near Real-time Data Analytics
Introduction
Nowadays, data is important to drive business growth. Many people rely on data for decision making, analysis, or reporting. Providing data as soon as possible becomes a crucial part in this process.
Generally, we work with various data sources that generate data used to drive business growth. One of them is the transactional data stored in a relational database, for example PostgreSQL. Suppose we used batch processing to acquire all the data from PostgreSQL to Google BigQuery powered by Apache Airflow on a daily basis.
With this approach, we got a delay of one day in providing the data. Moreover, the performance of the process will be degraded as the volume of the data increases. Hence, a new approach is needed so we can provide the data with minimum delay.
Traditionally, we can achieve near real time data analytics by using scheduled mini batch, in which we query only the updated data in the production database every x minutes. But, what if the data stored has a state and we need to know the exact time the state of each row is updated? How can we capture every change made to the database? Moreover, by doing the scheduled query, we will also increase the production database load.
To tackle those challenges, we use one of the change data capture (CDC) tools which is Debezium.
Debezium
What is Debezium
Debezium is an open source change data capture (cdc) tool which captures insert, update, and delete change in the database in real-time and delivers it to the downstream system. Debezium is a set of source connectors for Apache Kafka Connect that supports many connectors such as PostgreSQL, MySQL, MongoDB, Cassandra, etc. Since we work with PostgreSQL, we will only cover the PostgreSQL connector in this article.
How it works
To read and process database changes, Debezium utilizes the logical decoding feature in PostgreSQL which was introduced in version 9.4. It allows the extraction of the changes that were committed to the transaction log and convert it into a readable format with the help of output plug-in.
The data changes are then sent to messaging tools such as Apache Kafka, Amazon Kinesis, Google Cloud Pub/Sub, Apache Pulsar or Redis.
How to set up
PostgreSQL
The PostgreSQL connector needs an output plug-in to extract the changes. You might need to install it, but for PostgreSQL 10+ we use `pgoutput` by default. It is the standard logical decoding output plug-in maintained by the PostgreSQL community.
We also need to set the wal_level parameter to logical. For the already running PostgreSQL, in order to make the wal_level changes applied, we need to restart the PostgreSQL. Since we work with AWS RDS, we set the rds.logical_replication to 1 and verify that the wal_level parameter is already in logical by running the query SHOW wal_level.
Deploy on Kubernetes
Debezium can be easily deployed on Kubernetes by utilizing the Strimzi project. Strimzi simplifies the deployment of Apache Kafka on Kubernetes in various configurations by using custom resources. Strimzi operator can be deployed on Kubernetes using helm. Note to set watchAnyNamespace parameter to true if the Debezium will be deployed in a different namespace than Strimzi. Below is the example command to deploy the Strimzi operator.
helm repo add strimzi https://strimzi.io/charts/
helm install strimzi-kafka-operator strimzi/strimzi-kafka-operator — set watchAnyNamespace=true -n strimzi
Once the Strimzi operator is deployed, we can easily deploy the Debezium PostgreSQL connector by using custom resources provided by Strimzi. Prepare the deployment manifests to:
- Create Database Secret
- Create Role and Role Binding allowing to get the database secret
- Create Kafka Connect cluster with its configuration
- Create Kafka Connect Connector with its configuration. Note to set plugin.name parameter to the output plugin we use, in this case pgoutput.
After the Debezium PostgreSQL connector is running, we can see the data changes coming into Kafka in the form of json format.
Conclusion
Change Data Capture (CDC) helps us to achieve near real time data analytics by streaming every data change in our production database. Thus, reduce the data discrepancy between the transactional data and analytical data which has been our frequent problem. On top of that, it allows us to build historical data that can leverage the analysis in more depth.