Change data capture

Sreedhar Joshi
4 min readAug 18, 2020

--

Use a tool and don’t code it, Its harder than we think

Photo by Tyler Lastovich on Unsplash

We are seeing an increased uptake for micro-services based architecture, and one of the implications of the choice is a lot of coordinating services with their own data stores ideal for individual use cases. When an enterprise embarks on to this journey, first decision they have to make is how to reach the eventual architecture starting from their legacy monolith with a single large data store.

One common pattern that we see is to implement Change Data Capture(CDC) based architectures. The simple birds eye view explanation of this architecture is, we capture the changes “as they happen” at the source system and stream them across using data streams and have individual subsystem subscribe to the stream and implements its domain logic. This architecture has lot of takers because

  1. There is no disruption to the services offered by the legacy app
  2. The consumers of the service have luxury to move over to newer services as they have longer runway
  3. It gives comfort to the business, as they can always fallback to the proven system
  4. It provides a good implementation to strangler pattern and the application can slowly migrate different capabilities into individual subsystem

Typically when such large scale migration initiative is started, a lot of energy, focus, enthusiasm and more importantly funding is spent in new services that are being built and the often neglected aspect or the aspect that takes the back seat is the “Change Data Capture” component. Let us examine the options, if you are using data store like Postgres, we have options like Debizium, there is lot of buzz in these areas, however, in most of the enterprises the common data store is Oracle and the solutions in these space are usually paid offering like Oracle Golden Gate which are my no imagination cheap, the dilemma that the team is caught up in the situation is

  1. We are potentially migrating away for Oracle, is it worth investing money on solutions around Oracle
  2. How complex it can be to poll individual table and write to a data streaming platform like Kafka

In lot of the cases where the team needs to use their budget prudently, they choose homegrown CDC solutions.

In this section I would like to outline the factors and challenges associated with building such a system. Following are the requirements for developing such system:

  1. At least have one column that will indicate a change
  2. The data should be modified and not deleted
  3. Triggers can be enabled
  4. Choose the right frequency
  5. Auditability
  6. At least have one column that will indicate a change: Usually the column chosen for this need is a date-time field. We store the timestamp that we last have polled and identify the changes after that. The issues associated with such approach is
  • Right precision — How frequently the data change and is the field equipped to handle that precision
  • Integrity — How many systems updates that column and do they update them in write order and not with an older time stamp. Also, guarantees no dirty reads

2. The data should be modified and not deleted: In most of the systems, the requirements is mostly met for audit reasons. However, you will be surprised to see how many “maintenance” related jobs, deletes, corrects or prunes the data. Do your analysis thoroughly before you check this box off

3. Triggers can be enabled: This requirement is often not even considered as triggers are considered unacceptable for a large scale enterprise data store and DBAs will ask you to walk over their dead bodies before you enable them. But, spend enough effort to make that possible, this will prove to be one of the most critical feature you will long for and the effort is well spent

4. Choose the right frequency: This sounds the easiest of all questions but it has a great significance considering various workloads like there will be different characteristics of loads at various points in time

5. Auditability: This for me is the most important and difficult requirement. How do you know that you have not missed any data changes? and more importantly how can you prove that. This is the area where a lot of thought, experience and engineering capability of building a distributed system is needed

Overall, building our own CDC system presents for a compelling use case, but choose it carefully. A lot of thought, experience and engineering capability is needed to build a production grade CDC system. The data that is lost may appear trivial but “predictability” becomes a essential requirement and while investigating a production issue which is most likely tied to a lost data point.

A poorly build CDC system has a potential to undo all the great work done on the modernization and may cause all the great features that are built not to see the light of the day. I strongly encourage using a tool for the purpose and not get sucked into the temptation of building one, its time and money worth spending.

--

--