Building a CDC — Lessons learned — Part 1
In databases, change data capture (CDC) is a set of software design patterns used to determine and track changes in data so that action can be taken using the changed data.
In March 2020 I was given a mission: Build a system that will synchronize data between local databases to a centralized database. In the following 3 posts I will detail the lessons learned by myself and my colleagues during this journey until we reach production.
This post will focus on getting started and the data.
The second post will focus on the source database: Postgres or AWS Aurora RDS based on Postgres. Read more about the first steps I took building a CDC on the AWS Aurora here.
The third post will focus on development practices we’ve made to improve the performance and our ability to monitor and troubleshoot the system.
Learn from the predecessors
We started the journey by exploring our options.We realized very early that Debezium, unfortunately, doesn’t fit our mold, but that doesn’t mean we cannot learn from this wonderful project. We read its documentation, blog posts on the website, user experiences, and even looked at the code.
Once we chose to write the CDC in Go, we looked at projects such as https://github.com/chobostar/pg_listener.
We also read the documentation for the format, in case it is wal2json the readme proved to be extremely useful.
Check your alternatives, resources, and user experiences.
When we started out we wrote the CDC to use wal2json version 1, we didn’t think too much about it: version 1 was the default and everything seemed to work.
However, when we were testing the system on big databases we noticed that no data was sent by the CDC. But, the memory size of the process was growing slowly until the process died with an out of memory error or the connection to the database timed-out. We were in an endless loop because on every restart the CDC started from the same point.
The issue was that wal2json version 1 sends the entire transaction at once and we had huge transactions.
The largest I recorded was over 40K entities in one transaction!
The CDC receives the JSON and tries to unmarshal it into a Go struct, but unmarshalling the JSON takes a lot of time and memory due to its size, resulting in an out-of-memory crash or a timeout.
We solved this by moving to wal2json version 2, which streams every change as one message, regardless of the transaction.
Though this solved the CDC issue it posed a new problem, our AWS Aurora RDS Postgres did not support wal2json version 2 and we had to upgrade the entire fleet to a minor version that supports wal2json version 2. For plugin compatibility in AWS Aurora RDS Postgres see here.
Try your CDC on a database that behaves like production to see if you have issues with transactions, as early as possible.
Following the realization of the actual size of our transactions, we were wondering how much of these transactions are real changes and not some kind of a keep-alive mechanism we knew was active for some of our table entities.
To analyze the changed columns one needs to compare the old and new values for each column. However, Postgres by default only identifies the old record with the primary key.
But that could be changed, since Postgres 9, Postgres allows altering the replica identity for each table. This post is the best explanation I could find for it.
The default as I noted is identifying the old record by its primary key, this saves space on the replication slot thus is the desired configuration in most cases. However, for research purposes we used this feature to analyze tables with heavy update rate:
- Added an option to report the changed columns to the CDC
- Altered specific tables with high change rate to replica identity FULL, meaning getting the entire “old record”
Following the results of the analysis we realized we had a lot of keep-alive updates that we could throw away.
We considered staying with replica identity FULL, which would allow us to throw away these changes, however, at the time of writing this post (July 2021), we chose not to follow this route in order for us to keep the replication slot at minimal size.
Though we didn’t take action on the CDC side following the analysis, it did expose possible problems in the system which resulted in further R&D work, such as bug fixing.
- Analyze columns changed in heavy changing tables
- Explore different routes to resolve these issues
Thanks and credits
I would like to use this opportunity to thank my colleagues who have joined hands and minds with me to work on the CDC project and also reviewed this series of posts: Nir Barel and Yaron Edry. Without whom this project could not succeed.