AWS DynamoDB Streams — Change Data Capture for DynamoDB Tables

How to enable the change stream for a DynamoDB table? What is the composition of the change events? How read and process a DyanamoDB stream?

Dunith Danushka
Tributary Data
7 min readJun 6, 2021

--

Photo by nousnou iwasaki on Unsplash

DynamoDB provides two Change Data Capture (CDC) approaches to capture changes from a table. In this post, I’ll introduce you to DynamoDB streams as a reliable CDC mechanism, how it works, and what ways you can consume change events from it.

The problem

Imagine your application inserts an entry (a user object) into a DyanmoDB table. A Lambda function is interested in processing that record to send the new user a welcome email.

How would you implement that?

There are two options:

  1. The first application has to make a Dual Write. Meaning that it has to write to DynamoDB as well as notify Lambda about it. Perhaps, publish an event on an SNS topic?
  2. The Lambda has to poll the DynamoDB table periodically.

Both approaches are seemingly complicated or inefficient.

What if the DynamoDB table streams the item-level changes being made?

The solution

Fortunately, DynamoDB provides a Change Data Capture (CDC) mechanism for each table.

That means, if someone creates a new entry or modifies an item in a table, DynamoDB immediately emits an event containing the information about the change. You can build applications that consume these events and take action based on the contents.

Row level changes are streamed

DynamoDB keeps track of every modification to data items in a table and streams them as real-time events. Currently, there are two streaming models.

  1. Change events are written to a Kinesis Data Stream of your choice.
  2. Change events are written to a DynamoDB Stream with a unique ARN.

You can enable both streaming models on the same DynamoDB table.

You can enable both options for a DynamoDB table

In both models, change events are written to streams asynchronously without affecting the performance of the table. Also, events of both streams are stored after encryption.

Although two streaming models have some similarities, some differences exist in terms of event retention time, scaling options, ordering of events, and pricing. You can find a detailed comparison between the two streaming models in his article.

In this article, I’ll talk only about capturing changes into a unique DynamoDB stream.

Change Data Capture for DynamoDB Streams

Essentially, a DynamoDB stream is a time-ordered flow of information about item-level modifications in a DynamoDB table. When you enable a stream on a table, DynamoDB records every modification to items in the table and appends them to a log.

Every stream-enabled DynamoDB table gets a dedicated log file that keeps the change information for up to 24 hours. Applications can access this log and view the data items as they appeared before and after they were modified near real-time.

For each change stream coming out of a table, DynamoDB ensures the following:

  • Each stream record appears exactly once in the stream.
  • For each item modified in a DynamoDB table, the stream records appear in the same sequence as the actual modifications to the item.

Anatomy of a DynamoDB stream

Contents of a stream record

Whenever items are created, updated, or deleted in a table, DynamoDB Streams writes a stream record with the primary key attributes of the modified items.

A stream record contains information about a data modification to a single item in a DynamoDB table. Also, each record has a sequence number, reflecting the order in which the record was published to the stream.

When a change record is written to the stream, you can control which information the record should contain. There are four options to choose from:

  • Keys only — Only the key attributes of the modified item.
  • New image — The entire item, as it appears after it was modified.
  • Old image — The entire item, as it appeared before it was modified.
  • New and old images — Both the new and the old images of the item.

For example, when New and old images option is selected, a row update in your source table will produce a MODIFY event that will look something like this:

Notice the Keys, NewImage, and OldImage sections

Shards

Stream records are organized into groups, or shards. Each shard acts as a container for multiple stream records and contains information required for accessing and iterating through these records. The stream records within a shard are removed automatically after 24 hours.

Shards allow multiple consumers to read stream records in parallel to increase the processing throughput. Default limits allow up to 2 simultaneous consumers per shard.

The following diagram shows the relationship between a stream, shards in the stream, and stream records in the shards.

Courtesy of AWS DynamoDB documentation

Enabling/disabling a stream

You can enable a stream on a new table, disable a stream on an existing table or change the settings of a stream using either of the following ways.

  1. AWS Management Console — The most straightforward approach.
  2. AWS CLI — For the automation.
  3. AWS SDK — For programmatic control.

AWS documentation provides more information on this.

Each stream gets a unique ARN

DynamoDB creates a new stream with a unique stream descriptor when a stream is enabled on a table. If you disable and re-enable a stream on the table, a new stream is created with a different stream descriptor.

Every stream is uniquely identified by an Amazon Resource Name (ARN). The following is an example ARN for a stream on a DynamoDB table named TestTable.

arn:aws:dynamodb:us-west-2:111122223333:table/TestTable/stream/2015–05–11T21:21:33.291

Processing a DynamoDB stream

Before processing a DyanamoDB stream, the consumer must locate it.

Locating a DynamoDB stream

The naming convention for a DynamoDB stream endpoint is streams.dynamodb.<region>.amazonaws.com. If your DynamoDB table is located in the us-west-2 region, you must use the same region to access its DynamoDB stream.

That results in streams.dynamodb.us-west-2.amazonaws.com.

To access a stream and process the stream records within, you must do the following:

  • Find out the unique ARN of the stream.
  • Determine which shards in the stream contain the stream records that you are interested in.
  • Access the shards and retrieve the stream records that you want.

For that, DyanamoDB provides you with two options.

Using DynamoDB Streams Kinesis Adapter

The DynamoDB documentation says:

Using the Amazon Kinesis Adapter is the recommended way to consume streams from Amazon DynamoDB. The DynamoDB Streams API is intentionally similar to that of Kinesis Data Streams, a service for real-time processing of streaming data at massive scale. In both services, data streams are composed of shards, which are containers for stream records. Both services’ APIs contain ListStreams, DescribeStream, GetShards, and GetShardIterator operations.

As a DynamoDB user, you can reuse the semantics of Kinesis data streams to process a DynamoDB stream.

AWS SDK provides you a low-level API to access a Kinesis data stream. On top of that, you can use the Kinesis Client Library (KCL) to develop consumer applications. KCL is more convenient and supports multiple languages, including Java, NodeJS, .NET, Python, and Ruby.

You can write a DynamoDB consumer in the same way you access a Kinesis stream. To do this, you use the DynamoDB Streams Kinesis Adapter. The Kinesis Adapter implements the Kinesis Data Streams interface so that the KCL can be used for consuming and processing records from DynamoDB Streams.

The following diagram shows how these libraries interact with one another.

Courtesy of AWS DynamoDB documentation

To find out more on this, refer to this documentation.

Using AWS Lambda triggers

Another option is to use AWS Lambda functions to read and process change events from a DynamoDB stream.

You can associate the ARN of the DynamoDB stream with the Lambda function you write. Immediately after an item in the table is modified, a new record appears in the table’s stream. AWS Lambda polls the stream and invokes your Lambda function synchronously when it detects new stream records.

To understand how Lambda works with DynamoDB streams goes beyond the scope of this post. You can refer to this tutorial to find out more information.

Practical use cases of DynamoDB streams

Following are some of the examples that you can use DynamoDB streams.

  1. Instantly invalidate a cache.
  2. Update a materialized view to reflect the mutations done to the write side.
  3. Update a search index in real-time.
  4. Update a dashboard in real-time.
  5. Perform aggregations on the fly and save them back to another table.
  6. Implement the Transactional Outbox pattern in Microservices.

Summary and where to next?

Here, I barely scratched the surface of DynamoDB streams capabilities. In a production setting, there are a few factors you should consider. For example, you must determine you to scale your KCL workers or Lambda functions according to the write and read throughput of DynamoDB tables. This post provides an excellent overview of scaling up DynamoDB streams.

Also, when consuming a stream, you must put in extra effort to deal with errors. Because there can be partial failures while processing the stream records, you can’t simply tell DynamoDB that you failed to process three records out of a hundred. There’s nothing it can do about it.

So I suggest you read this excellent article by Anahit Pogosova on dealing with batch operations and partial failures.

--

--

Dunith Danushka
Tributary Data

Editor of Tributary Data. Technologist, Writer, Senior Developer Advocate at Redpanda. Opinions are my own.