An exciting example that shows the depths of CDC technology

Zsombor Chikán
Cloudera
Published in
8 min readFeb 23, 2022

Thank you for the co-author Tamas Kiss working with me on this post.

Context behind the necessity of CDC

To understand the state of different business operations, it is necessary to know the business transactions stored in relational databases. Organisations, on the other hand, need a way to easily analyse data as they are produced, as the value of data declines rapidly over time. Therefore, for analysis and to avoid database disruptions, companies often replicate data in data warehouses and traditionally they use a batch-based approach to move data once or several times a day.

However, another and possibly ideal solution for moving data in near real time is CDC (Change Data Capture), which allows you to move data from relational databases (such as SQL Server or Oracle) to data warehouses, datasets or other databases.

One step back and let us introduce the Change Data Capture (CDC)

Change Data Capture (CDC) is a set of technologies that allow you to identify and capture data that has changed in the database.

What is Change Data Capture?

From another perspective, CDC is a software process that supports the tracking and identification of changes of data in a database.

An important aspect is that CDC is well-suited for achieving low-latency, reliable, and scalable data replication in high-velocity data environments, where time-sensitive decisions are made.

One of the biggest advantages with this approach is that you are continuously streaming data from your DB to your data warehouse, so the data in your warehouse is up-to-date, allowing you to create real time insights, giving you a leg up on your competitors in terms of making business decisions on more recent data.

Push vs Pull

From the data flow direction perspective, we can distinguish two main ways in relation to CDC systems. Either the source system pushes changes to the target, or the target periodically polls the source and pulls the changed data.

Why do push-based systems often require more work from the source system? The main reason is because they need to implement a solution that understands when changes are made and send those changes in a way that the target can receive and action them. It means a constant polling of the source and keeping track of what is already captured. A possible downside related to this can be that if the target system is down or not listening for changes for whatever reason, it can miss changes.

With frequent logging of changes (which usually means updating a column in the table), pull-based systems are often much simpler for the source system. However the disadvantage of the pull approach is that it often increases the latency. So this often happens to data being pulled in batches, which means large batches once a day or lots of small batches pulled in several times within a given period of time, because the target has to poll the source for updates rather than being told when a change has been made.

Since our use case (which will be introduced soon, below) requires low latency, we intended to build a real-time data processing system, so the push approach would be a better fit to our simple demo.

After we get a small taste of CDC technology and get a high-level understanding about this technology, let us move on to the next section to our simple example and try to be familiar with a possible way to implement CDC.

How can we use CDC?

It sounds great to achieve real time business insights, but how complicated is it to make it happen? It cannot be a challenge with the proper tool in your hand, and for this purpose we can consider the usage of SQL Stream Builder (SSB). As part of Cloudera Streaming Analytics, it enables users to easily write, run and manage real-time SQL queries on streams with an exceptionally smooth user experience, while it attempts to expose the full power of Apache Flink SQL-API.

In this post we will give you an example and discuss how to leverage the CDC related opportunities when you connect your legacy databases to Elasticsearch via SSB.

Let’s see a real example with a simple fraud detection logic

To demonstrate the power of CDC we created a streaming pipeline covering various aspects of its already mentioned advantages. The purpose of this pipeline is to detect suspicious events coming from online bank transactions, which are enriched with data from a legacy database. The alert events are then sent to a Kafka topic, which can be consumed by downstream applications, for example monitoring or indexing. The application logic is written in Flink SQL and executed via SSB. Our choice for monitoring was the industry standard Elasticsearch and Kibana stack, which can be easily integrated with Flink via connectors. Basically the design of the pipeline is as follows:

Our input data consists of streaming transaction data coming from a PostgreSQL database via CDC and has the following schema:

CREATE TABLE `transactions_source` (
`transaction_id` INT NOT NULL,
`customer_id` INT,
`location` INT,
`device_id` INT,
`trans_ts` TIMESTAMP(3),
`is_successfull` BOOLEAN
)

The input data is randomly generated by a few “datagen” connectors, where the change of the location or device id for a given customer can be controlled by parameters.

Our alert detection logic is built upon the rapid change in locations or used devices for transaction attempts for a particular user. To be able to detect the changes we maintain other databases which contain the known locations and device ids for users and we use that information to perform the alert logic and we also keep these updated in the meantime. We consider a location or device known if the corresponding transaction attempt is successful. Based on these assumptions we came up with the following design of our alert detection logic:

Zoom-in to the SQL queries that made up the above pipeline

Flink job #1 — Update the known_devices table

In our simple demo, the prerequisite for a record which is intended to be saved into the legacy known_devices table is that the used device for a particular transaction should be unknown, but the transaction itself should be successful. We achieved this by using the above left join on the known devices table where the null hence not known device ids are filtered out. Also note that we used a look up join to be able to join the transactions with the latest version of the known devices table and not using temporal join.

Flink job #2 — Update the known_locations table

The known_locations legacy table updated with the same approach as the known_devices as you can see in the above query.

Flink job #3 — Our simple fraud detection logic

Our detection logic (which is so simple for demo purposes) is based on two main prerequisites: the (1)device and location must be unknown for a particular (2)unsuccessful transaction. So as you can see above, we joined together our CDC transaction source table with known_device and known_locations, with the previously defined look-up-left join approach, and we could filter out the proper data and make a data stream from that as input for a predefined Kafka table.

Data visualisation

Our downstream pipeline consists of a simple Flink job which pulls data from Kafka into Elasticsearch. After we have the data there we can leverage the full power of Kibana and we can create different charts (or build dashboards) which can update-ed in real time and give meaningful insights on the suspicious events.

Chart example #1 — Highlighted the unsuccessful transactions and possible frauds on a timeline

As an example we can build up a stacked bar chart (as you can see above) which can show us how many of all transactions for a certain period of time were unsuccessful and how many were suspected of fraud. It can be really helpful for decision makers to track all the business events in real time which can give them the opportunity to make actions in that point when the detected possible frauds appear.

Chart example #2 — Distribution of the successful and unsuccessful transactions and also the possible frauds within all the events

That also can be an important information whether “how big is the red part” within our transactions in a given period. With the stream that we generated, Kibana can show us the distribution of the possible fraud events in real time.

Conclusions

There is no right or wrong method to use for capturing changes in our data. With this post we tried to discuss a certain way of CDC usage via a simple demo with a possibly real use case (fraud detection). We intended to show how it is possible and how easy to leverage CDC capabilities with a tool like SSB (by Cloudera) which has its own implemented CDC-connector. But the three main take-aways are:

  • Depending on their own use case, each organisation/business has the opportunity to gain a competitive edge by using the CDC technology
  • Today we have numerous tools (e.g. SSB) to create the most suitable CDC pipeline for us without any other difficult integrations
  • It is possible that you have been able to visualise key business insights to managers more slowly and often outdated, but with the implementation of CDC and with the appropriate pipeline, the real-time, constantly updated data visualisation can be available to all businesses (an example is our simple demo)

Finally, it is often difficult to design our own solution, but in the long run it is definitely worthwhile to use various fresh technological solutions that allow the fastest and most accurate data analysis possible.

--

--