How Change Data Capture Works: Understanding the Impact on Databases

Published in

Striim

12 min readMay 11, 2024

Change Data Capture (CDC) is a method used to extract database transactions in real-time or near real-time. Many Data Engineers and Analytics Engineers adopt CDC platforms like Striim, HVR (FiveTran), GoldenGate, or Qlik Replicate to replicate operational data from Databases to their Data Warehouses or Data Lakes such as BigQuery, Databricks, Snowflake, Fabric and others. Change Data Capture enables businesses to react swiftly to real-time data without impacting production operational systems, which is crucial for event-driven architectures and analytics. This post examines the impact of CDC on Operational Databases from minimal to significant levels, analyzing techniques such as redo log mining to the maintenance of shadow tables.

Based on some in-person discussions I had in the data community, I observed there’s a perception that Change Data Capture is viewed exclusively as a requirement for real-time analytics.

While CDC certainly enables real-time analytics if downstream consumers are not a bottleneck, the main value proposition of Change Data Capture is providing the most performant, least impactful way of replicating data from an operational database, regardless of the downstream latency requirements for analytics.

In a practical sense for data engineers, the beginning of a database replication project should not alarm DBAs and software engineers with potentially disruptive requests like polling the production database or adding triggers. Instead, data engineers need to present themselves as knowledgeable, considerate of potential concerns, and proactive in collaborating to provide insights into the operational state reflected by the database.

Why we care about database performance

Databases are critical for both business operations and customer-facing applications, where performance directly influences user experience, operational efficiency, and ultimately, business outcomes. High-performance databases ensure that applications run smoothly, data retrieval is swift, and transactions are processed quickly, which is vital for maintaining customer satisfaction and competitive advantage. For instance, in e-commerce, faster database responses can lead to quicker page loads and transaction processing, directly affecting sales conversions and customer retention. This is why we don’t want internal analytics workloads competing for database resources.

Implementing Change Data Capture using direct queries or maintaining shadow tables, can significantly influence the management of the database’s cache, particularly the Least Recently Used (LRU) cache objects. The database query cache is designed to keep frequently accessed data in memory, optimizing the performance for repetitive queries. However, the operations involved in CDC like frequent querying or updating shadow tables can disrupt this optimization.

These CDC methods might populate the cache with data that is less frequently accessed by operational applications, potentially evicting more critical data from the cache and degrading overall application performance. This scenario underscores why minimal impact CDC methods, such as log mining, are often preferable from a performance optimization perspective. These methods are less likely to interfere with the normal operation of the database’s query cache, maintaining a better balance between data capture and operational efficiency.

Long story short, we want the database to be tuned for application purposes, not for the convenience of internal analytics workloads.

Varying Degrees of Impact on Databases

I ran a poll on both LinkedIn and Twitter to gauge the data community’s assumptions about the impact of Log-Based Change Data Capture, and the results were polarizing to say the least!

To answer my respective polls on Twitter and Linkedin, the correct answer is YES. There’s always memory overhead of performing Change Data Capture on an operational Database. However, impact can vary from minimal — such as the memory for a thread to tail a disk on-file to extreme — such as periodically running queries on your database. That can be the difference between some minor memory tuning to your database to support Change Data Capture versus doubling the size and cost of your database to support batch queries.

To help visualize this, I created a matrix of performance impact from various methods of Change Data Capture.

Here we show how Striim performs Change Data Capture from Oracle. You can see some of the memory usage on the database is ‘house money’ so to speak: Oracle’s Program Global Area will actually maintain a redo log buffer regardless of external Change Data Capture consumers for downstream analytics. Striim will simply subscribe to the changes published in the online redo log, and buffer them in-memory and on-disk. We’ve added extensive work to support long running transactions off-heap in Striim’s system layer, which also offloads processing from the operational database.

Minimal Impact: Redo or Write Ahead Log Mining

At the minimal end of the spectrum is log mining. Database systems like Oracle use the redo log files for log mining, which is a relatively low-impact method of implementing CDC. The Oracle database maintains a redo log to record all changes made to the database. This log is essential for data recovery and is also used in log mining to track changes.

The actual memory impact here is primarily on the Shared Global Area (SGA) in Oracle. The SGA is a memory region that contains data and control information for a server process. It is used to process SQL statements and to manage the data as it moves through the system. In the context of CDC, when Oracle’s log mining feature (LogMiner) is used, it reads from the redo logs and uses the SGA to store the session’s private SQL area for processing the mined data. Oracle’s native logmining can cause out of memory errors.

The SGA size can vary based on the workload and the specific configuration of the Oracle instance. However, since log mining processes only read the redo logs and do not require maintaining a separate physical structure for the changes, the memory overhead is relatively controlled. The key to minimizing impact in such configurations is to ensure that SGA memory is sized adequately to handle the peak workload without causing significant performance degradation.

Oracle also supports Archive Logs, which can serve as a backup of the redo log for long term storage and recovery purposes. You can also mine data from the Archive Logs with minimal impact, given the work of generating the redo log was already done by the database, and you’re just spinning some threads to tail a file from the database’s operating system.

Low Impact: Binary Log CDC in MySQL and Logical Replication in PostgreSQL

Moving towards a moderate level of impact, we consider MySQL’s binary log-based CDC. MySQL’s binary log records all changes to the database, both data and structure, as a series of events. This method is similar to Oracle’s log mining but includes some operational differences that may influence memory usage.

The binary log itself is a file-based log, not directly impacting the database’s memory under typical operations. However, reading these logs for CDC purposes, especially when using external tools or custom scripts, can increase memory usage depending on how the changes are processed and staged before they are consumed or replicated.

PostgreSQL’s logical replication offers a balance between performance impact and real-time data synchronization capabilities. Logical replication in PostgreSQL involves streaming changes at the logical level, rather than copying physical data blocks, allowing for more selective replication and lower overhead compared to physical replication methods. This method captures changes to the database schema and data in the form of logical change records, which are then transmitted to subscriber databases.

The impact on memory and overall database performance with logical replication is generally low-to-moderate. Unlike methods that require frequent direct queries or the maintenance of shadow tables, logical replication leverages a publish-subscribe model, which minimizes the disruption to the main database operations. This approach allows PostgreSQL to maintain high performance by not overly taxing the database’s cache or CPU resources, making it particularly suitable for applications that require real-time or near-real-time data updates without a significant performance trade-off. Logical replication is highly configurable and can be tuned to replicate entire databases, specific tables, or even specific rows, offering flexibility that is valuable for maintaining efficient database operations and ensuring data consistency across distributed systems.

PostgreSQL Write-Ahead Logic with Logical Replication

Mining the PostgreSQL Write-Ahead Log (WAL) can be an intensive operation, especially when done frequently or on large databases. Here are some of the disk-related issues that can occur from this process:

Increased Disk I/O: The WAL records every change made to the database, so mining it means reading through these records. This can lead to increased disk I/O, which might strain the storage system, especially if it’s not equipped with high-performance drives like SSDs.

Disk Space Consumption: The WAL can grow significantly in size, especially in a busy database system with lots of transactions. If the WAL files are not managed properly (e.g., archived or cleaned up regularly), they can consume a substantial amount of disk space, potentially filling up the disk.

Performance Degradation: As the disk begins to fill up, or as I/O operations increase, you might notice a degradation in overall system performance. This can affect not just database operations but other applications that rely on the same disk resources.

Fragmentation: Over time, continuous writing and deleting of WAL files can lead to disk fragmentation. This can degrade the performance of the disk as it requires more time to read scattered pieces of data.

Risk of Data Loss: In extreme cases, such as when the disk is full or nearly full, new transactions might fail or the system might not be able to write new WAL entries. This could lead to transaction failures or, in worst cases, data corruption if the system behaves unexpectedly due to disk space issues.

To mitigate these issues, it’s important to:

Regularly monitor disk space and I/O metrics.
Implement proper WAL archiving and cleanup strategies.
Use high-performance disks for databases that require intensive I/O operations.
Consider scaling out your storage or using a dedicated storage system for WAL files to isolate the impact from other operations.

Moderate Impact: Maintaining Shadow Tables and Change Stream Implementations

SQLServer Change Data Capture with Shadow Tables

On the more significant end of the spectrum is the use of shadow tables for CDC, such as implemented by SQL Server’s Change Data Capture feature. This method involves creating and maintaining additional tables (shadow tables) that mirror the structure of the monitored tables and hold all changes made to the data. Each insert, update, or delete operation on the target table is reflected in the shadow table, capturing the old and new values of the affected rows.

This approach has a more pronounced impact on memory usage for several reasons:

Increased Storage Requirements: Shadow tables increase the storage requirement as they duplicate a significant amount of data.
Increased I/O Operations: Manipulating shadow tables requires additional read and write operations, which can lead to increased memory usage as data pages are loaded into and evicted from the cache.
Overhead of Trigger-Based Tracking: In some implementations, triggers are used to populate shadow tables. Triggers themselves consume memory and CPU resources, further adding to the overhead.

MongoDB Change Streams are another method of implementing Change Data Capture (CDC) that generally has a minimal to moderate impact on database performance, depending on the scale and configuration of the deployment. Change Streams allow applications to access real-time data changes without the complexity and risk of tailing the oplog (operations log) directly. They provide a more streamlined and scalable approach to reacting to database changes, making them particularly useful for applications that need to trigger actions or update external systems in response to data modifications within MongoDB.

Performance Impact of MongoDB Change Streams

Low Overhead: Change Streams utilize MongoDB’s built-in replication capabilities. They operate by listening to changes in the oplog, a special capped collection that logs all operations that modify the data stored in databases. Since the oplog is already an integral part of MongoDB’s replication infrastructure, tapping into this system adds minimal overhead.
Scalability: MongoDB’s horizontal scalability through sharding means that Change Streams can also scale by distributing the load across multiple servers. This scalability helps in maintaining performance even as the volume of changes increases.
Selective Listening: One of the key features of Change Streams is their ability to filter and only listen to specific changes, which can significantly reduce the amount of data that needs to be processed and transmitted. This selective approach minimizes the memory and network bandwidth used, thereby mitigating the impact on the overall database performance.
Optimizing Performance with Change Streams

To optimize the performance when using MongoDB Change Streams, it’s crucial to:

Filter Changes: Apply filters to only subscribe to the relevant changes needed by the application, reducing unnecessary data processing.
Monitor Load: Keep an eye on the replication window and the impact of Change Streams on the primary database operations, especially in high-throughput environments.
Adjust Oplog Size: Ensure the oplog is appropriately sized to handle the volume of changes without frequent rollovers, which could lead to missed changes or higher latency in Change Streams.

Highest Impact: Periodic Batch Jobs with Query-Based CDC

The highest memory impact in CDC operations often occurs when CDC is implemented through periodic queries against the database. This approach involves running full or incremental queries at regular intervals to detect changes in the data. This method can be highly resource-intensive, particularly for large databases or databases with high transaction volumes.

The main challenges with batch query-based CDC include:

High Resource Utilization: Running complex queries to detect changes can consume significant CPU and memory resources, as these queries often involve scanning large portions of tables or joining multiple tables.
Impact on Database Performance: Frequent and resource-intensive queries can degrade the performance of the primary database operations, potentially leading to slower response times for other applications using the same database.
Data Freshness Issues: Since this method captures changes at intervals, there is a latency in data capture, which may not be suitable for scenarios requiring real-time or near-real-time data syncing.

How we designed Striim

Change Data Capture is a powerful technique for enabling real-time data processing and synchronization. The choice of CDC method should consider not only the operational requirements but also the impact on database performance, especially memory usage. From minimal impact techniques like Oracle’s log mining to more intensive methods like SQL Server’s shadow tables, and the highest impact method of periodic query-based CDC, each approach has its trade-offs that need to be managed to maintain overall system efficiency and performance.

At Striim, we’ve optimized CDC to offer both high performance and ease-of-use, supporting log-based CDC and CDC from change tracking tables, thus simplifying data movement and reducing total cost of ownership as highlighted by our clients like American Airlines: “Striim is a fully managed service that reduces our total cost of ownership while providing a simple drag and drop UI. There’s no maintenance overhead for American Airlines to maintain the infrastructure.”

By optimizing Change Data Capture for low impact data capture and low latency delivery to consumers, enterprises like UPS are able to build real-time AI applications to battle porch pirates.

You can learn more about Striim’s leading performance in Change Data Capture here with our respective partner endorse blogs:

Striim Oracle to BigQuery Benchmark

Striim Oracle to Snowflake Benchmark

If you want to start developing real-time, low impact Change Data Capture pipelines you can try Striim’s free trial and free community version.