Building a CDC — Lessons learned — Part 2

In databases, change data capture (CDC) is a set of software design patterns used to determine and track changes in data so that action can be taken using the changed data.

In March 2020 I was given a mission: Build a system that will synchronize data between local databases to a centralized database. In the following 3 posts I will detail the lessons learned by myself and my colleagues during this journey until we reach production.

In the first post I focused on getting started and the data.

In this I post will focus on the source database: Postgres or AWS Aurora RDS based on Postgres. Read more about the first steps I took building a CDC on the AWS Aurora here.

The third post will focus on development practices we’ve made to improve the performance and our ability to monitor and troubleshoot the system.

Photo by ConvertKit on Unsplash

Replication slot keep-alive

When a message is received from the replication slot it includes a Log Sequence Number (aka LSN), it is the duty of the CDC to acknowledge to the replication slot that the handling of the message is done by sending a standby status update message with the LSN of the message.

This flow is very critical because it is an indicator that specific messages have been consumed and thus can be deleted from the replication slot.

But consider the following, when consuming data from the replication slots you have the option to filter the tables you are interested in, if you are choosing a table(s) with very few updates you are filtering out most of the data changes and thus cannot acknowledge that an LSN has been consumed.

This may seem like a non-issue, but remember that when your CDC restarts you will most likely want to start from the last acknowledged LSN, a stale LSN may force you to repeat changes you’ve already processed.

The recommendation, in this case, is to create a mock-table in your database with a periodic update to it and have the CDC request the table when consuming from the replication slot.

Lesson learned

Monitor the rate of LSN commits, if low add a mock table.

Replication slot formats

As noted in the transactions section or the first post, we started out with wal2json version 1 and then changed to version 2.

However, when running the huge transactions on AWS Aurora RDS, we encountered delays between the end of the transaction and the time the data was published to our message bus.

At that time we also started monitoring, on our Datadog dashboard, the AWS metric: aws.rds.oldest_replication_slot_lag, which according to the documentation indicates: Replication delay on cross-region read replica with highest lag (MB).

We could definitely see a correlation between the two: the reported slot lag was Peta Bytes, and a long journey time to the message. We didn’t fully understand it, but we were worried.

We consulted with AWS support and decided to give the pgoutput format a try.

To highlight what pgoutput is: It is a binary format maintained by Postgres itself and can be used for data replication. We thought that being binary, it may be faster than a format based on JSON (i.e. wal2json).

To test it:

  1. Created an isolated environment
  2. The automation team filled its database with ~80K entities
  3. A script updated the database tables at an agreed rate, measured how long it took the data from the moment the transaction started until the data was found in the message bus
  4. The measuring ran on two branches: wal2json version 2, pgoutput

Unfortunately, the results were inconclusive, roughly 55 seconds for 2000 records transaction in both cases, and we put the pgoutput format back to a possible technical debt.

Lesson learned

Measure speed for different formats.

Replication slot status

As I was exploring the extensive monitoring we added to the CDC I noticed that one of our systems stopped sending data roughly 24 hours earlier.

I checked the machine’s health, the logs, the database health, and everything seemed to be alright, until I ran the following query:

select * from pg_replication_slots

The active column was ‘f’, meaning false. If you check the documentation you’ll realize this means that the replication slot is not being used.

But why?

As I was running a “tail -f” on the log file, the docker container was restarted, the replication slot returned to status active = ‘t’, but I then saw the following error in the log file:

Unknown message type &pgproto3.ErrorResponse{Severity:\”ERROR\”, Code:\”XX000\”, Message:\”could not open relation with OID 0\”, Detail:\”\”, Hint:\”\”, Position:0, InternalPosition:0, InternalQuery:\”\”, Where:\”\”, SchemaName:\”\”, TableName:\”\”, ColumnName:\”\”, DataTypeName:\”\”, ConstraintName:\”\”, File:\”reorderbuffer.c\”, Line:3354, Routine:\”ReorderBufferToastReplace\”, UnknownFields:map[uint8]string{0x56:\”ERROR\”}}

When I checked the replication slot it was back to active=’f’. We repeated this test and got the same result.

The message was just a result of the CDC printing to the log anything it didn’t know how to handle.

The cause for the error and the disconnection is unknown to us, but solving it was relatively simple: Whenever ErrorResponse was received we reconnected to the replication slot and added monitoring for the disconnections.

Lessons learned-

  1. Print to the log any unknown message
  2. Monitor active status of the replication slot
  3. Plan your CDC to bounce back from disconnections and errors


Thanks and credits

I would like to use this opportunity to thank my colleagues who have joined hands and minds with me to work on the CDC project and also reviewed this series of posts: Nir Barel and Yaron Edry. Without whom this project could not succeed.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store