Building a CDC — Lessons learned — Part 3

Sigal Shaharabani
Israeli Tech Radar
Published in
4 min readAug 1, 2021

In databases, change data capture (CDC) is a set of software design patterns used to determine and track changes in data so that action can be taken using the changed data.

In March 2020 I was given a mission: Build a system that will synchronize data between local databases to a centralized database. In the following 3 posts I will detail the lessons learned by myself and my colleagues during this journey until we reach production.

In the first post I focused on getting started and the data.

In the second post I focused on the source database: Postgres or AWS Aurora RDS based on Postgres. Read more about the first steps I took building a CDC on the AWS Aurora here.

In this final post I will focus on development practices we’ve made to improve the performance and our ability to monitor and troubleshoot the system.

Photo by AltumCode on Unsplash

Consuming data from the message bus

Early on we decided to use Kafka as our message bus because it was commonly used in the company and we considered it reliable and familiar to us.

However, consuming data from topics (corresponding to tables) which have a high rate of changes presented us with some bottlenecks.

The following diagram depicts the flow of data from the source to the targets:

From Postgres to Kafka, from Kafka two alternative routes, which enrich the data and push to either Redis or a second Kafka. Consumers from the second Kafka write data into an AWS Aurora RDS database
CDC flow of data

Streaming, batch or micro-batch

In most cases, we chose to stream data and consume messages one by one.

However, we have scenarios that require enriching tuples with data that can be found in the database.

For these specific scenarios we separated the relevant topics into different processes that work in small batches to allow enrichment of all tuples at once.

This shortened the enrichment of 5000 tuples from minutes to seconds.

Parallelism

Tables with many entities that change frequently were mapped to topics with multiple partitions and configured (in runtime) to be consumed by multiple processes to allow parallel processing of the data.

To maintain order the partition key of every topic in the system is the entity ID.

Database inserts

At this time (July 2021) all the data is directed to a centralized AWS Aurora RDS database.

Our monitors showed that when the system became active we reached the maximum IOPS allowed by the RDS resulting from the number of changes that were committed to the database (upserts and soft deletes), and we encountered delays in data reaching the database.

By introducing a de-duping mechanism for topics that presented a high rate of changes, we were able to reduce the number of writes to the database by 90%.

Lessons learned

  1. Measure journey time for your entities and consider different strategies to resolving your bottlenecks
  2. Monitor target data stores

Tests

On every step of the way between the source database and target data stores, we used testing techniques to monitor our quality and backward compatibility.

Unit tests were used to test decision making, error handling, and internal formats.

Component tests were used to test for output formats, which proved to be very useful when changing the replication slots format.

Component tests were also used to test the integration of the data in business applications.

Lesson learned

Write component tests as soon as you have an agreed format.

Monitoring

CDC data that flows from a source database via multiple Kafka clusters into multiple data stores is similar to a funnel of data.

On every entry of a new change, we increased a counter and tagged it with the machine name and table name, whenever we published the change we increased another counter with the same tags, any ignored tuple was also counted.

All these counters were sent to a Datadog server and were presented in a new dashboard that we created. The dashboard allowed us to filter data by machine, environment (e.g. staging), table, and consumer micro-service. On these counters, monitors were created on the Datadog server to alert suspicious values.

In addition, we added monitors on the source and target AWS Aurora RDS and local Postgres, to monitor disk size, replication slot size, CPU, IOPS.

To read more about monitoring AWS Aurora RDS in Datadog see here.

This allows us upon receiving an alert or a complaint, to open the dashboard and understand the state of the system and also create a playbook to resolve these issues.

Lesson learned

Plan how to monitor the flow of data and the functioning of single components.

In conclusion

To summarize the experience and the lessons learned, I recommend that you study the behaviors of your database and model the flow of data from the CDC to be able to replicate it and measure it on your system.

I believe monitoring would allow you to analyze these behaviors and results.

Resources

Thanks and credits

I would like to use this opportunity to thank my colleagues who have joined hands and minds with me to work on the CDC project and also reviewed this series of posts: Nir Barel and Yaron Edry. Without whom this project could not succeed.

--

--