8 Practical Use Cases of Change Data Capture

How to apply the practices of Change Data Capture to reliably move data from operational databases to other systems for other purposes.

Dunith Danushka
Tributary Data
7 min readJun 20, 2021

--

Photo by EJ Strat on Unsplash

Operational data accumulated in OLTP databases often need to be taken out to perform useful tasks other than transaction processing. That includes moving data out to data warehouses, updating caches, dashboards, etc.

Change Data Capture (CDC) is the process of observing all data changes written to a database and extracting them in a form in which they can be replicated to downstream systems for other purposes.

In this article, I discuss some real-world use cases where CDC is applicable. Most of the examples are related to Debezium, on open-source CDC engine built on top of Kafka.

If you are new to CDC and Debezium, you can refer to my previous articles for a quick primer.

A Gentle Introduction to Event-driven Change Data Capture

A Visual Introduction to Debezium

1. Invalidate a cache

Key-value stores like Redis and Memcached are used as caches to speed up the reads. Usually, they keep pre-computed results of SQL queries in memory so that subsequent trips to the database can be eliminated.

The biggest challenge of caching is how do you keep cached data consistent with the source data. For example, how soon a record update in the source database is propagated to the cache?

Often, caches are timed out and automatically invalidated periodically. But with the help of the CDC, entries of a cache can be invalidated in an event-driven manner.

Suppose we have a DynamoDb table to capture the votes against candidates. Whenever a new vote is recorded, it is captured as an event and published to DynamoDb streams, to be processed by a Lambda function. The function’s logic aggregates the vote count and writes that into an ElasticCache. Thus, a dashboard can directly read the cached result eliminating the need for DynamoDb to run the aggregation every time.

The following talk by Gunnar Morling walks you through the implementation of a low-latency data pipeline for cache updates based on Debezium, Apache Kafka, and Infinispan.

2. Update a search index such as Elasticsearch

We often use OLTP databases as our operational Systems-of-Records. But they are not suitable to perform specialized operations such as full-text searching. Although databases can be equipped with relevant extensions, people often offload full-text searching operations to specialized systems like Elasticsearch.

For example, we can maintain a highly denormalized aggregate view in Elasticsearch to cater to queries such as orders by the customer. In a relational database, this query will be served by joining multiple tables, and it is not scalable if there’s a large number of rows to join and the query is frequent. But Elasticsearch can capture this information into a single document and enable consumers to search orders by product name.

Elasticsearch keeps denormalised views of aggregations

The challenge we have here is how to propagate source system changes to Elasticsearch reliably and keep two systems consistent all the time.

A CDC system like Debezium helps you at this point by detecting source database changes and propagating them to Elasticsearch in a scalable and reliable manner.

Source

In the above figure, Debezium picks up the row-level changes of MySQL and writes that to a Kafka topic. Then, an Elasticsearch sink connector deployed to Kafka Connect reads the changes and propagates to the relevant index of Elasticsearch. You can read more about that from here.

3. Real-time data loading into a Data Warehouse

Operational databases are not a good choice to run heavy analytical workloads as they can hinder the performance of regular OLTP operations. Therefore, operational data must be moved to a specialized system such as a data warehouse to run analytical queries that power the BI and reporting needs.

There are two approaches; ETL is the traditional method where you batch operation data and load them to the warehouse periodically. The downside of ETL is the latency. But with CDC, you can capture source system changes as they happen and deliver them to the warehouse in real-time.

Today, many data warehouses allow loading data in a streaming manner. AWS Redshift, Google BigQuery, Apache Druid, and Apache Pinot are few examples.

You can capture the changes of the DynamoDb table and write them into a Kinesis stream. Then use Kinesis Firehose to load them into RedShift. Similarly, you can use Debezium to move operational data into a Kafka topic, which Apache Pinot can read in a streaming manner.

Moving Microservices data into a data warehouse with Debezium — Source

4. Synchronize on-premises data to the cloud

Sometimes it is often required to move operational data produced at the edge locations such as on-premise systems (e.g., POS systems, COTS applications) to a central database located in the cloud. The main goals are to utilize the scalable and durable storage options provided by cloud vendors.

For example, on-premises transactional data can be sent to a cloud data warehouse where rich analytical operations can be performed without provisioning expensive infrastructure on-premises. Another use case would be to migrate on-premises data to a new application provisioned in the cloud.

You can utilize CDC to efficiently capture on-premises database changes and propagate them to the cloud. Most often, it is done as continuous data replication between two databases located on-premises and cloud.

5. Event-driven update of materialized views for Microservices

Microservices architecture promotes having a database per service to keep data that is private to the service. Although that improves the autonomy and scalability across services, that can lead to some complications.

For example, when a service tries to access the data owned by another service, the only way is by making an API call. But this model is not scalable when there are cascading API calls involved. The consumer has to perform inefficient in-memory joins to still the responses from multiple services.

API Composition is the only way to perform this kind of orchestrations — Source

As a remedy, Microservices rely on Command Query Responsibility Segregation(CQRS) pattern to decouple state mutations from queries. Instead of querying services in real-time, services can listen to the domain events coming from downstream systems and update internal materialized views to perform queries locally. That improves the read performance, overall system availability, and autonomy.

Consider the Orders View, that keeps a local view of outside data — Source

6. Reliable Microservices Data Exchange With the Outbox Pattern

Another problem common to Microservices is that reliably updating data across multiple service boundaries. For instance, consider a microservice that manages purchase orders. When a new order is placed, information about that order may have to be relayed to a shipment service and customer service.

Saga pattern helps to solve this problem to some extent. But implementing a Saga is often complicated and brittle due to its synchronous nature.

CDC comes into the picture by capturing changes made to an Outbox table and propagating them to downstream Microservices in a reliable and eventually consistent manner.

Debezium blog has an excellent article about implementing this with Debezium.

Writes to Shipment and Customer services are captured in the outbox table first. Then propagated to downstream services with CDC — Source

7. Real-time information dissemination

CDC allows you to capture changes from source databases as a stream of events. Those events can then be processed with stream processing engines such as Apache Flink to apply transformations or run aggregations in real-time. Cleansed events are more meaningful than their raw versions. So they can be used for human consumption as follows.

Update real-time dashboards

Dashboards such as Microsoft Power BI can be updated in real-time by reading from a streaming data set.

Publish to an asynchronous API

Processed changed events can be written to a WebSocket so that subscribers can take appropriate actions.

8. Build an audit log

It is a common requirement for business applications to maintain an audit log, i.e., a persistent trail of all the changes to the application’s data.

In its natural form, a CDC system like Debezium captures, stores, and propagates changes according to the order in which they were applied to the source system. Thus, the target system that receives this data can build up the chronological order that changes had been applied to the source system and trace back to a specific point in time.

Changes can be enriched with information coming from external systems to provide a holistic view that would aid in doing forensic analysis on incidents. For example, by replaying the audit log from the beginning, one can find answers to who did what action at which time.

Debezium blog has an in-depth article on the subject. You may read it for more information.

All writes are kept in order — Source

--

--

Dunith Danushka
Tributary Data

Editor of Tributary Data. Technologist, Writer, Senior Developer Advocate at Redpanda. Opinions are my own.