Real-time data processing using Change Data Capture and event-driven architecture

Engineers at Macquarie

Published in

Macquarie Engineering Blog

8 min readJan 11, 2024

By Ranjit Singh, Principal Engineer, Wealth Core at Macquarie Bank

Overview

In Banking and Financial Services (BFS) at Macquarie Group, we’re implementing event-streaming and event-driven architecture at scale with an aim to achieve 100% real-time event processing.

To that end, we are increasingly building our platform designs around the principles of microservices that communicate using events. Each microservice publishes its events to a stream, in addition to storing certain data elements to its data store, if needed.

A common challenge when ingesting data from existing applications is that transformation of these monolithic applications into microservices is further down the road or the application is an off-the-shelf product that cannot be modified. In both these cases, we have no real way to integrate these existing applications with new microservices from the application layer — synchronously or asynchronously.

As we are building our microservice to be responsive, elastic, resilient and message driven, we also needed to send real-time data updates to these downstream services instead of a batch of events.

To overcome this challenge, our engineering team in the BFS Wealth division implemented an event-sourcing pattern using what’s known as Change Data Capture (CDC), a modern technique to stream database updates as events that can be consumed by downstream services.

Path to a Minimum Viable Product and design decisions

CDC is the process of identifying and capturing changes made to data in a database and then delivering those changes in real-time to a downstream process or system. It is a technology feature that database servers offer to allow capturing data update operations from the database transaction log. This transaction log can be used to publish event stream for consuming applications using two different mechanisms, namely Debezium and AWS DMS.

Diagram showing CDC Event Streaming process

In order to implement an event-sourcing pattern, it was essential for us to evaluate these technology choices from the lens of our present technology ecosystem and understand the best way to achieve the Minimum Viable Product (MVP).

So that we could achieve our MVP, some of the key decisions our engineering team made are as follows:

Database CDC compatibility
While we ensured our database and its specific version supported CDC, we also explored any impacts on CDC because of our production maintenance procedures. In doing so, we found that after enabling CDC on the MS SQL version SQL 2019 with CU5, when CDC is disabled for any administration, it can’t be re-enabled. The proposed solution was to update database version or apply specific patches that need elevated access. As our database was supporting an off-the-shelf product, upgrading the version would have entailed extensive testing and longer time to market. Therefore, we analysed the impact and went with the safer option to apply the patch for CDC stored procs. It is worth highlighting that when running on AWS RDS, this patch could only be applied by AWS support.
CDC Event Streaming technology
Both AWS DMS and Debezium come with some inherent advantages. One of the unique capabilities of AWS DMS is its ability to switch over from database replication to CDC at a particular point in time. While in this case we were not looking for a one-time sync between two data sources before switching to CDC, AWS DMS has proved valuable for us in other such use-cases. AWS DMS can target event stream to AWS Kinesis or Amazon MSK (or self-managed Apache Kafka.) An AWS-managed service can give significant reliability, provided your downstream services are in AWS cloud to readily consume. We found that provisioning these services was a significant work in our setup. Debezium, on the other hand, is built around the concept of open-source connectors. Various event streaming and messaging platforms embed database-specific Debezium open-source implementation in its versions of the connector. While some of these connectors (such as Solace or Google Pub/Sub) can be deployed and run as containers independently, others such as Kafka work on top of Kafka Connect cluster (more on this soon). For the purpose of this project we chose to use Debezium primarily to keep our stack oblivious to underlying cloud technology.
Database and Connector compatibility
There is a broad choice of event streaming platforms, such as AWS Kinesis, Amazon MSK and Apache Kafka for AWS DMS to Solace, Apache Kafka and Google Pub/Sub using Debezium. However, there may be fewer than these options available for Debezium, depending on the database you want to stream events from. For instance, the only Debezium connectors available for MS SQL at the time of this writing are Solace and Kafka (Pub/Sub connector soon to be available). We opted for setting up Debezium connector as it was relatively straightforward with our existing technology stack.
Wrangler or Post-Processor
As CDC events are based on database schema, in order to translate this event into a more readable business event additional processing can be done in the event streaming platform. Kafka can be a great platform to add such additional post processors. In our case though we chose to build our CDC events in readily consumable data product to avoid the need of a post-processor.

Building for resilience and the right level of abstraction

As Debezium is open source, there were various ways for us to implement this in Macquarie’s enterprise setup. It was possible to run it as a standalone application (server) on a computer engine of our choice, but using connectors built by platforms such as Kafka or Solace was our preferred option.

Below is quick comparison of Solace vs Kafka wrappers for Debezium that are distributed as connectors.

* It is possible to publish to messaging brokers other than Solace/Kafka if running a Debezium open source server. This can be wrapped into a docker container.

As highlighted in the above implementation for Solace connector, it is possible to run this connector as a container on a Kubernetes platform of choice. As existing users of the Kubernetes platform (GKE), this was fairly straightforward for us. However, this solution also has a requirement for persistent storage on the cluster for offset file storage. That presented a challenge for us as persistent storage is not a strategic choice on our GKE platform. While there are options to back-up the offset file to block storage (GCS/S3) in the event of container restarts, this was less than optimal for an enterprise app. Therefore, pivoting to a Kafka Debezium connector was a more resilient choice for us.

Most messaging and streaming platforms provide SDKs for integration. There can be some inherent advantages if you are using these SDKs, especially if you are doing something very specific that requires deeper control. However, you would appreciate how daunting it can be to switch between two different message brokers when using proprietary SDK. For most event publishing and consumption scenarios, there is a much better abstraction implemented by the Spring Cloud team. Thanks to the capabilities of the current Spring Cloud Stream framework, supported by most mainstream messaging platforms, we have been able to pivot our implementation from Solace to Kafka within a day.

Overall design matters — measure, benchmark and optimise

Going back to the rationale of using CDC as a mechanism for real time event processing, it was essential for us to ensure the overall system was able to provide optimal throughput when processing peak traffic. In a domain of average complexity, you could easily have 2–3 microservices and potentially a few databases that will be traversed in an end-to-end call. Therefore, it is not only important to measure, benchmark and optimise against current peak load, but also factor in the increase in the load in the near future. The diagram below shows a simplified version of the interactions in our system.

Diagram showing event-driven system using CDC for event sourcing

A few choices that have helped to make the above design performant include:

• Scalability of the compute — Running our containers in a Kubernetes allows us to scale the compute.
• Stateless microservices — Each request is independent of the previous request this helps in horizontal scaling.
• Database optimised for Read/Write — While still using relational database, we have stored data in a denormalised form where possible in JSON data columns. We still use index-based retrieval, but avoid writing to multiple tables.
• Messaging middleware — In BFS we have been increasingly using Google cloud for running our Kubernetes workloads. Also given Google’s proficiency in efficient data handling, services such as Google Pub/sub are purpose-built to out-perform other message streaming platforms at increased load. Cloud Pub/Sub has been designed with high volumes of messages in mind, this has to do with how Pub/Sub balances the pull requests between listeners. This also means latency improves significantly if your throughput increases. On the flipside it can struggle sometimes if the message throughput is too low. Given that aspect of Pub/sub some companies like Spotify have even gone ahead and built their own async client to benefit from the ultra-low latencies at higher throughput. While in our case we found that our throughput was high enough to not require such additional measures, we have been able to improves some of our batch process that now runs 10 times faster.

It is worth highlighting that we measured and adjusted above parameters by running performance tests against our upstream APIs. We used a tool called Gatling, which is extremely powerful in adjusting to test a variety of scenarios through simple code configurations.

Observability

When you operate in production with so many moving parts, setting up observability becomes a non-negotiable, especially for developer productivity. There are a multitude of ways we have setup observability to ensure we have a deeper insight into what an application is doing in production.

These include:

Monitor impacts of CDC on Database performance. Any performance degradation will fire alerts.

Monitor Kafka performance. There are both infrastructure as well as message delivery alerts.

Messaging middleware monitoring Dashboard and alerting. There are a variety of alerts available to work with for Pub/sub.

Open telemetry tracing of end-to-end requests.

Exploring further

It’s important to highlight that CDC requires consistent fine tuning as we incrementally enable it on more tables. There are other alternatives such as time-based polling that are simpler to implement, but only provide close to real-time processing.

In our space, CDC has enabled us to overcome the need of batch processing where possible and accelerate the shift towards straight through processing. At the same time, it is also serving our future use-cases of data as a product. This engenders higher utilisation of ‘trusted data’ by making it discoverable by various consumers. In addition, in an event-driven architecture, a common design is for microservice to write to local database and publish relevant events for downstream systems. Transactional Outbox pattern is a preferred solution in such scenarios, especially to address atomicity of those two operations. To that end, we are also trialling CDC to implement transactional outbox in our upcoming solutions and will have more to share soon.