Designing Cost-Efficient Change Data Capture on AWS Serverless

Published in

Insider Engineering

6 min readAug 14, 2023

What is CDC?

One of the definitions of CDC is “In databases, change data capture (CDC) is a set of software design patterns used to determine and track the data that has changed (the “deltas”) so that action can be taken using the changed data.”

We can deliver value to our partners by taking action on product data changes. While there are pre-existing solutions to track changes, since we fully own our data pipeline, we can build a precise solution that tracks only necessary data to reduce costs and improve performance.

Why do we need to track product data change?

The value of change tracking can be realized in many ways. Some common examples of value-delivering changes are price drops and back-in-stock scenarios used by Journey Builder. The system can match other information with the changes in price and stock level, and send notifications to users. For example, a campaign management system can notify e-commerce users when the price of the product in their cart or wishlist drops, or the product’s stocks are refreshed.

Product data tends to have different update frequencies for different fields. There may be different streams updating the same product. Since data comes from multiple sources, it is essential to build a pipeline that can integrate them all.

Besides, other teams may need frequently updated fields to be reflected in their data sources and calculation. Instant updates of price and stock status are needed to display correct products in search, and recommendation.

Application-level CDC

The system is designed as a serverless CDC on AWS Lambda and DynamoDB services.

A tandem of three computational components achieves the whole endeavor.

Change Subscription API manages subscriptions that define the criteria for changes and the corresponding hook URL where the matching changes will be sent. Subscriptions are stored in a database.
Change Detection tracks product updates and detects changes in the tracked fields, then sends the changes to the Kinesis stream.
Change Delivery evaluates changes and subscription criteria, and sends the triggering changes to the subscription’s hook URL.

The Change Detection pipeline consumes product catalog updates.

Implementing application-level Change Detection, Change Subscription API, and Change Delivery system on AWS Serverless Lambda and DynamoDB allowed for a greater level of flexibility and reduced costs.

Serverless CDC. No need to run a compute instance for continuous CDC.
Subscription-aware CDC. We can narrow down change tracking to a subset of data that has subscriptions. This enables us to fully leverage the pay-as-you-go model to reduce operational costs.
DB-agnostic CDC. No DB or proprietary connector dependency, everything is handled using simple queues/streams and compute logic.
Polished data storage and data access patterns. Can easily migrate to Kubernetes / HBase / Kafka or any other compute-storage-stream trio that meets scalability requirements.

A custom subscription language was developed so that consumers can subscribe only to the necessary subset of changes. Technologies had to be chosen and challenges had to be tackled to guarantee the durable, and correct delivery of those changes. The whole lifecycle of each change was logged to enable precise tracking of a change’s detection and delivery lifecycle. The following sections describe these matters in more detail.

{
   "partner_id": ...,
   "change_condition":{
      "logic":"and",
      "groups":[
         {
            "logic":"or",
            "expressions":[
               {
                  "field":"price.EUR",
                   ...
                  "function":"percentage_down"
               }
            ],
            ...
         }
      ]
   },
   "hook":"www.example-hook.com/listener"
}

(A sample subscription with a change condition for the price-drop scenario. The combination of multiple expression groups with AND/OR enables constructing complex expressions that cover many cases.)

Choosing the right storage

Durable, consistent, and fast storage is needed for efficient change tracking.

Since tracked changes do not possess relational or queryable data, RDS was off the table. Change data can be on a heavier side, so in-memory databases like Redis/ElastiCache were not suitable either. Considering latencies, speed, and scalability, DynamoDB had an upper hand over MongoDB with DynamoDB DAX (Accelerator) option. DynamoDB transactions also have strong consistency needed in a change tracking system. See more on DynamoDB autoscaling here.

Redis was used for the deduplication of incoming records, it was chosen for its lowest latency and high throughput for small keys. More on deduplication will be covered in the exactly-once semantics section.

Challenges: Exactly-once semantics and 3 levels of deduplication

The change notifications can have a direct impact on users, so it is important to ensure that there are no duplications in campaign-related notifications sent to a single user. To ensure that each change is delivered exactly once, deduplication measures have been implemented on multiple layers.

First, the Kinesis shard partition key is arranged so that all records related to the same product are collected on the same shard. This prevents the processing of duplicates of the same product in parallel, reducing the risk of race conditions.

The duplicates in the same batch are always removed, and the remaining records are subjected to conditional deduplication.

Change Detection must always be aware of the last state of an item, so it implements conditional deduplication by checking the modification timestamp of the incoming record and rejecting it if the timestamp is not newer than the previous processing of the item.

Additionally, the listener APIs may also implement deduplication measures.

Challenges: Field-level modification time tracking

The product modification time is stored in change data, as well as the modification time of each field. This design lets consumers identify the last changed field and provides them with older updates of the other fields. On one hand, it accommodates situations when multiple fields can change in an instant. On the other hand, including other tracked fields in the change payload, even if they are not modified, helps to minimize the overall system load.

Challenges: Change lifecycle visibility

The change lifecycle visibility ensures updates made to the Product Catalog are delivered to the subscribers. Every update record must be trackable to a point. Here are the major steps that impact a record’s delivery in the Product Catalog Change pipeline.

It starts when a product update is written to the Kinesis stream. The Change Detection reads the stream, filters, validates, deduplicates, and converts the input to the tracked item format. It compares the new tracked item with its previously recorded state to identify changes in the tracked fields. When a change is detected the ChangeItem, containing both the original record as well as the old and the new changes, is streamed to the Kinesis stream.

The Change Delivery deduplicates incoming change items, completes missing fields, and evaluates completed records it according to active subscriptions in Subscription API. This way it triggers subscription hooks for matching changes.

Throughout the process, detailed logs of all the steps mentioned above are logged to the Kinesis stream and are queryable through Athena.

*Hourly Changes Triggered Chart. Delivering over 100 thousand changes an hour, and ready to scale.*

Summary

Navigating the complexities of data change can be difficult, but with the correct tools and approach, it becomes a task that is not only manageable but also very efficient. To handle the complicated process of Change Data Capture, our solution leverages the power and flexibility of AWS Serverless, along with novel custom-built capabilities. This powerful system ensures accurate real-time tracking and distribution of data changes, allowing for the smooth integration of updates from diverse sources.

The process included careful storage option selection, the creation of a unique subscription language, and the development of effective mechanisms to achieve exactly-once semantics. These processes enabled us to create a scalable, efficient, and cost-effective data pipeline that benefits our partners and users.

Follow us on Insider Engineering Blog to read more about our AWS solutions at scale and engineering stories. Here are more stories you may enjoy.