Unlock the Power of Your Data with Change Data Capture

Published in

Blog Técnico QuintoAndar

9 min read2 days ago

How often have you and your team struggled with inefficient batch data ingestions that overwhelm your source databases, limit data replication to once a day, and offer only a high-level view of your data? We’ve been there too, but Change Data Capture (CDC) has revolutionized our approach to data ingestion and analysis.

Change Data Capture (CDC) is the process of tracking all events in a data source, so they can be captured and sent to a destination. This ingestion strategy has several advantages compared with traditional batch ingestion, such as higher data granularity, real-time data replication, and less impact on source databases, since data is replicated as it occurs during the day, not all at once.

Imagine that you work in a retail company where the Data Platform team sustains a batch ETL that runs every day at midnight, extracting data from a source database into the data lake in S3. Your team cannot keep track of all the updates that the orders suffered during the day, because by the time the data ingestion runs, it only captures the orders' final status, which is "Completed". In this scenario, you do not know the other transactions that happened during the day, just the last one.

Let's go through a diagram. The company's source database contains a table called “orders” comprising all the purchase orders made on the website. This table has the following structure:

Note: The status column is updated during the order's lifetime.

Assume that this order has its status updated 3 times a day, as shown in the picture above. At the end of the day, when the ETL is scheduled to extract data from the source database, this order has the status “Completed”, which is the information being ingested in the data lake. Summing up, all the other status updates over the day will not appear in the destination table, only the final one, which means that the final user would not be able to know any details about the status happening before the one reflected in the destination table:

The lack of visibility into the order status changes throughout the day can lead to superficial analysis, as the business teams may struggle to fully understand the consumer journey and extract valuable insights. For example, they might not know if a user took 1 minute or 10 hours to complete the payment, which is crucial information for identifying customer behavior patterns. This understanding is essential for developing strategies to ensure order completion and enhance customer retention.

At QuintoAndar, we encountered challenges similar to those previously described. Recognizing the relevance of these issues, we identified the need to replicate data in a more granular manner to gain deeper insights into user behavior, in our case landlords and tenants for example. Besides that, it was important to enable data analysis at a frequency higher than once a day and reduce the impact on source databases during our ingestions. This enhancement would improve the decision-making process, unlock numerous strategic opportunities for our business teams, and improve resource usage on source databases. This led us to develop our data ingestion strategy using Change Data Capture (CDC). In this article, you will learn about the CDC concept, the available CDC tools, and how it can be implemented to unlock the power of your data.

What is CDC and how can it improve your data pipelines?

CDC, or change data capture, is a process of identifying changes made to data in source systems, to capture and replicate them in real-time or near-real-time. This eliminates the need for batch loads and allows us to replicate data continuously and incrementally.

There are 3 methods that are most commonly used by CDC tools to detect changes in data:

Timestamp-based: in the timestamp-based approach, each table in the database should include a column, such as LAST_MODIFIED, which reflects the time of the most recent change. The CDC tool queries this column to identify and retrieve updated records. It’s important to note that this strategy does not process DELETE operations; it focuses on soft-deletes, as hard-deletes (permanent removals) would not leave a trace in the table.
Trigger-based: for the trigger-based method, most databases come equipped with predefined actions that automatically trigger in response to INSERT, UPDATE, or DELETE operations. These events are stored in separate tables within the same database, often called event or shadow tables. The CDC tool then reads these shadow tables to capture and track data changes.
Log-based: in the log-based approach, transactional databases generate logs containing information about all committed changes (INSERT, UPDATE, or DELETE) in the database. While these logs are typically created for backup and disaster recovery purposes, the CDC tool can leverage them to capture and replicate all database events, ensuring a comprehensive record of changes.

By using a CDC tool, every time the table is updated all the changes will be captured and sent to the destination. Differently from traditional batch ETLs, all the changes are replicated at the moment they occur and the final table will have all the order status that happened, not only the one registered last. This allows us to get insights from a much more granular view, enables trustworthy incremental ingestions, and makes less impact on source databases, compared to full-load batch ingestions. Interesting, right?

To enable CDC in your data pipelines, you must first select the appropriate CDC tool for your environment. This decision can be challenging, as it requires careful consideration of various factors to ensure the tool aligns with your business strategy, the structure of the source database, and your team’s technical skills. The following section is designed to guide you through this discovery process and help you overcome the initial challenges of choosing the right CDC tool for your business.

What is the best CDC tool?

Spoiler: there is no perfect CDC tool that suits all companies, but you can certainly find the best tool for you business requirements.

The process of choosing the perfect CDC tool for your business requirements can be a challenging task because there are several tools available. For example, Fivetran, Qlik Replicate, Talend, Oracle GoldenGate, IBM Infosphere, and Debezium.

Here at QuintoAndar, we chose to implement Debezium, which is a group of services that work together to track changes in databases. These changes are then organized into a stream of events, which can be sent to various storage systems. For most cases, Debezium is deployed using Kafka Connect, which is a framework and runtime for implementing and operating source and sink connectors. The image below shows an example of the architecture of a change data capture pipeline using Debezium.

Debezium Architecture. Source: Debezium.io, https://debezium.io/documentation/reference/stable/architecture.html

Debezium suits our business better than the other options - considering our technology stack, use cases, and business requirements - but it may not be the best option for other scenarios, so choose carefully and go for what is best for you, based on your requirements.

What should I take into account when choosing my CDC tool?

There are several factors that may be considered while choosing the perfect CDC tool for your data ecosystem. Here are some key aspects to take into account:

Source Database Compatibility: make sure that the CDC tool is compatible with your source databases, as different tools may offer varying support for different database systems;
Target System Compatibility: check if the CDC tool supports the systems where you intend to replicate the captured data;
Performance and Scalability: evaluate its performance and scalability to handle growing data volumes, and consider the impact on your source databases' performance during data capture and replication;
Ease of Integration: look for a CDC tool that seamlessly integrates with your existing infrastructure;
Schema Evolution Support: check the tool's ability to handle schema evolution, because a good CDC tool should accept schema changes without requiring manual intervention;
Monitoring: find a tool that allows you to track the health and performance of the CDC process effectively;
Cost and Licensing: consider the cost of implementing and maintaining the CDC tool, including licensing fees, and scaling-related costs.

By considering these factors, you can make a good decision and choose a CDC tool that aligns with your specific needs.

Extra: improve the communication between microservices using the Outbox Pattern and CDC

In distributed systems, we need to guarantee that microservices are independent of each other and communicate with their database instances. Due to that isolation, the communication between the services has to be asynchronous - usually using Kafka as a message broker -, and the changes across multiple services may have consistency and reliability problems. The Outbox Pattern acts to solve data synchronization issues that may occur during asynchronous transactions.

Outbox Pattern example. Source: https://microservices.io/patterns/data/transactional-outbox.html

The Outbox Pattern consists of splitting the communication between the service and the message broker, by adding a component called "outbox table" - a table in the database - that receives a new record each time a transaction occurs in the main table. After that, a separate process starts to read the records from the outbox table and publish the messages to the message broker. When the message is successfully published in the broker, the outbox process marks the record in the outbox table as processed, guaranteeing idempotency and preventing data synchronization issues. If you want to understand more about the Outbox Pattern, check out the article "Pattern: Transactional Outbox" by Chris Richardson.

How CDC can enhance this process?

Usage of the CDC tool Debezium with the Outbox Pattern. Source: Adapted from https://microservices.io/patterns/data/transactional-outbox.html

By adding a CDC tool to work as the mechanism that polls data from the outbox table and propagates to the message broker, the process will improve in many aspects, such as:

Real-time change propagation;
Easy integration with Message Brokers, due to the inherent compatibility of most CDC tools with frameworks like Kafka;
Schema evolution support, given that CDC tools often handle schema changes gracefully, typically requiring only the configuration of a parameter in the connector to allow schema changes;
Automation of the outbox table population, by detecting changes and creating the corresponding records.

Conclusion

The implementation of the Change Data Capture (CDC) strategy in our data pipelines at QuintoAndar has opened up numerous possibilities for our data analysis, enabling us to extract more valuable insights to better understand the journey of a user when renting, buying, or selling a property. This has significantly enhanced our understanding of the business, improved our decision-making process, and unlocked many other real-time analysis opportunities.

Additionally, the shift in our data ingestion approach — from a resource-intensive process that required extracting all data at once — to a more efficient method has allowed our teams to reduce the size of database instances and deactivate read replicas (read-only copies of a database instance used to reduce the load on the primary database). The cost reduction has been substantial, not only due to the decreased impact on databases and the deactivation of replicas but also because CDC enables reliable incremental ingestions, eliminating the need for daily batch full-load ingestions, which require high usage of compute resources.

I hope that this article has helped you to better understand how CDC could improve the quality of our solutions and offer new possibilities for our data pipelines. Whether used with traditional ingestion pipelines or integrated with the Outbox Pattern, Change Data Capture is a key strategy to unlock the full potential of your data!

If you want to join our team, take a look at our open positions!

QuintoAndar Carreiras

Faça parte do time que está reinventando o mercado imobiliário e o jeito de morar. Junte-se a nós!

carreiras.quintoandar.com.br