Log-based Change Data Capture — lessons learnt

Overview, benefits, use cases

Andreas Buckenhofer
Mercedes-Benz Tech Innovation
5 min readSep 3, 2020

--

CDC architecture with source DB and target systems
CDC architecture

The article summarizes experiences from various projects with a log-based change data capture (CDC). There are many use cases for which CDC is beneficial. Some DBs even have CDC functionality integrated without requiring a separate tool.

Change detection

Detecting changes in transactional source systems often turn out as slow and not scalable. Lightweight, high-performance change detection is required. Well-known approaches are: change columns, triggers and log-based CDC.

  • Change columns
    Tables in the source system get created with an additional column, such as a timestamp. Such a column can be used to identify new or changed data records.
    However, this solution does not work for deleted records. Another disadvantage is that the application needs to be changed with update logic for the column.
  • Triggers
    Database triggers can protocol any data changes to a log table. A process transfers the data from the log table to the target.
    This approach recognizes deletes. Additionally, the application does not need to be changed.
    However, triggers put a strain on the source database from a performance point of view. If a transaction commits, it must wait until the trigger finished, too.
  • Log-based CDC
    Databases write changes into their transaction log. Backup and recovery need transaction logs. Additionally, sequential writes into a transaction log are much faster compared to random writes into data files.
    Open source tools like Debezium or commercial tools like Oracle Goldengate, IBM IIDR read changes from the log and replicate the changes to the target system.

Pros and Cons of log-based CDC

Log-based CDC has fundamental advantages over the other change detection methods listed:

  • The additional performance impact on the source system is low.
  • CDC enables the implementation of near real-time architectures with the possibility of faster and more accurate decisions based on the target systems.
  • No significant changes to the application in the source system are necessary. Changes relate to, for example, the setting of database parameters to generate extended transaction log entries, or installing the CDC software.
  • CDC reduces the amount of data transmitted over the network compared to the other detection methods described above.
  • The coupling between source and target is asynchronous. A slow consumer does not affect the source database.

Log-based CDC also has some disadvantages you must be aware of:

  • Many commercial tools require an additional license.
  • Avoid no-logging transactions as these bypasses the transaction log. However, most databases allow force logging through suitable parameter settings.
  • Separate tools require operations and require additional know-how.
  • Primary or unique keys are required for many log-based CDC tools — but a good database design will guarantee this requirement anyway.
  • If the target system is down, transaction logs must be kept until the target absorbed the changes.

How does log-based CDC work?

The diagram above shows several uses of log-based CDC. Users or applications change data in the source database, e.g. insert, update, or delete data. The database writes all changes into the transaction log (or Write-ahead log) before integrating the changes into the data files.

The RDBMS example shows a user inserting data and a second user updating the same data in the next transaction. The transaction log contains the change protocol until some retention-interval clears the entries. The data file contains just the current state.

Databases use transaction logs primarily for backup and recovery. But the data can also be used to replicate changes into a target system. CDC tools use the database API to read from the transaction logs (or from the memory if the data is still in the DB cache) and transfer the data to a target system.

The example shows a logical view of a CDC tool reading data from the transaction log of the source system and submitting the changes to target databases or a target data lake. There is no persistent storing of the data stream within the CDC software. Kafka could be used as a log-centric approach to capture the changes long-time and submitting the changes to further target systems.

The diagram shows two cases of data propagation:

  • Replication-mode (“Target Mirror DB”)
    Replication creates a copy of the source: updates will change the data in the target. Deletes will remove data from the target.
    The target system does not necessarily have the same structure as the source system. It is possible to only choose a subset of tables or to replicate some selected columns within a table or filter data.
    The example shows the target DB, which contains the current state of the changes.
  • Auditing-mode (“Target DWH DB” and “Target Data Lake”)
    Auditing-mode keeps the whole data history. The CDC tool converts updates and deletes into inserts in the “Target DWH DB” or “Target Data Lake”. A flag indicates the kind of operation (insert, update, delete) and a timestamp field tracks when the event did happen.
    The example shows the “Target DWH DB” and “Target Data Lake”, which contain the complete protocol of changes, including operation (insert, update, delete) and timestamp.

Use Cases

There are a variety of use cases for CDC:

  • Replicate changes into a DWH or data lake.
  • Replicate changes into Kafka in a microservices architecture.
  • Upgrade a database to a higher or the latest version with minimal downtime, e.g. upgrade PostgreSQL 11.9 to PostgreSQL 12.4.
  • Migrate data from database X to database Y with minimal downtime, e.g. migrate MariaDB to PostgreSQL or vice versa.
  • Transfer data into the cloud over a secure connection.

Tooling

Commercial tools like Oracle Golden Gate, IBM IIDR, and many others gained a high maturity, and the performance is excellent even during peak loads. These tools can replicate data between heterogeneous database systems.

Additionally, there are some other interesting products:

  • Debezium is an open-source product built on top of Kafka. Debezium has connectors to pull a change stream from DBs like PostgreSQL, MySQL, MongoDB, Cassandra, etc. and send the data to Kafka. There are some exciting activities in the open-source community going on.
  • DynamoDB Streams can be enabled for the NoSQL DB DynamoDB. DynamoDB Streams captures table activities and provides JSON streams.
  • Tables in VoltDB can be declared as export targets. The declaration means that VoltDB will send a stream of changes into Kafka for such tables.

Conclusion

CDC is an essential component in modern architectures for transferring transactional data from systems into a data stream.

CDC enables the provisioning of transactional data in real-time without causing a significant load to the source system, requires no changes in the source application and reduces the transferred amount of data to a minimum.

Keeping data in only one system is not always possible. Be cautious with personal data. You have to know all appearances of personal data for deletion requirements according to GDPR, for example. A data catalog can help to keep track of replicated sensitive data.

--

--

Andreas Buckenhofer
Mercedes-Benz Tech Innovation

Principal Vehicle Data Architecture. Years of experience in data-driven solutions and end-to-end data products. Lecturer on data topics at DHBW University.