DynamoDb Incremental Backups

Abhaya Chauhan
pageup-tech
Published in
9 min readMar 14, 2017

DynamoDb is an AWS fully managed NoSQL service, which provides a fast and predictable data store. We’ve been using it for several microservices in the past 18 months.

However, one feature that is sorely missed: incremental backups.

AWS provides an option to take snapshots of your table using a service called DataPipeline.

At a high level, what this does is:

1. Create an EMR (Elastic Map-Reduce) cluster
2. Perform a parallel full scan of the table in question (while consuming read units) into JSON data
3. This JSON data can be uploaded to S3 or similiar

DynamoDb to S3 Template in Data Pipeline Architect

The issue I have with this?

The backup is not an “point in time” snapshot - it’s essentially scanning the table (which can take hours) while the table is still live. It’s arbitrary.

Our requirements for DPO (Data Point Objective) is 30 minutes. Which basically means, if shit hits the fan we can only have 30 minutes of data loss (in the worst case). This is our contractual agreement with our clients.

Investigating ways to solve this problem

Given this, it has led us to creating incremental backups for DynamoDb, stored in an S3 versioned bucket.

DynamoDb Incremental Backups to S3

I’m not going to delve into DynamoDb too much. If you’re reading this blog post, I will be assuming you know about DynamoDb, looking to use it, or are already using it.

DynamoDb Streams

Let’s delve into the DynamoDb Stream. DynamoDb Streams allow you to capture mutations on the data within the table. In other words, capture item changes at the point in time when they occurred.

DynamoDB Streams — High Level

This feature enables a plethora of possibilities such as data analysis, replication, triggers, and backups. It’s very simply to enable (as simple as a switch), and it basically enables an ordered list of table events for a 24 hour window.

When you enable the stream, you’ll have four options:

  • Keys only — only the key attributes of the modified item.
  • New image — the entire item, as it appears after it was modified.
  • Old image — the entire item, as it appeared before it was modified.
  • New and old images — both the new and the old images of the item.

For our use, we’ll need to enable the New and Old images.

Lets walk through an example of the sort of data you will see. Use case:

1. INSERT record
2. UPDATE record
3. DELETE record

The DynamoDb Stream will contain these events:

https://github.com/PageUpPeopleOrg/dynamodb-replicator/blob/master/test/fixtures/events/insert-modify-delete.json

As you can see, for an INSERT event, we rely on the NEW image (there is no old image).

For a DELETE event, we rely on the OLD image (there is no new image).

Keep in mind, these events are only guaranteed to be available for 24 hours in the stream. After 24 hours, it can be cleaned out at anytime.

To access the Stream, there is a seperate DynamoDB Streams API available. Under the covers, it is identical to Kinesis Streams. We’re not going to delve into this as it is quite involved, but may revisit in a later blog post.

If you’re interested, feel free to check out: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html#Streams.Processing.

Lambda

AWS Lambda has been an exciting new service that has promised to change the way cloud perceived, through the evolution of serverless architecture.

AWS Lambda is a serverless compute service that runs your code in response to events and automatically manages the underlying compute resources for you. You can use AWS Lambda to extend other AWS services with custom logic, or create your own back-end services that operate at AWS scale, performance, and security. AWS Lambda can automatically run code in response to multiple events, such as modifications to objects in Amazon S3 buckets or table updates in Amazon DynamoDB.

The interesting thing that Lambda enables is the last line. The ability to turn on “distributed database triggers” for your DynamoDb tables. I shuddered when I realised what this could do, and scared the pain this would unleash on the world… but with great power, comes great responsibility.

What does the Lambada function do?

Essentially, a “Lambda function”, is code that we can provide Lambda, and this can be triggered based on table updates in DynamoDB.

Tying this back to DynamoDb streams, we can associate our Lambda function, to a DynamoDb Table (which under the covers, simply polls the DynamoDb Stream of the table).

Lambda currently only supports Python, Node.js and Java, with more languages on the horizon.

Code example

When an event is available, it is passed to the Lambda function to execute.

exports.myHandler = function(event, context, callback) {
...

// Use callback() and return information to the caller.
}

In the syntax, note the following:

  • event – AWS Lambda uses this parameter to pass in event data to the handler.
  • context – AWS Lambda uses this parameter to provide your handler the runtime information of the Lambda function that is executing. For more information, see The Context Object (Node.js).
  • callback – You can use the optional callback to return information to the caller, otherwise return value is null. For more information, see Using the Callback Parameter.

For our solution, the Lambda function doesn’t need to do much, and really shouldn’t. This needs to be reliable, and self-healing.

All we want to do is take the event that is passed from the DynamoDb Stream and push it to the appropriate location in an S3 Bucket.

Check out the source code for the lambda function here: Github Repository

Lets step through an example of what it will do with this event:

{
"Records":[
{
"eventName":"INSERT",
"eventVersion":"1.0",
"eventSource":"aws:dynamodb",
"dynamodb": {
"NewImage":{
"range": {
"N": "1"
},
"id": {
"S": "record-1"
},
"val": {
"B": "aGVsbG8="
},
"map": {
"M": {
"prop": {
"B": "aGVsbG8="
}
}
},
"list": {
"L": [
{
"S": "string"
},
{
"B": "aGVsbG8="
}
]
},
"bufferSet": {
"BS": [
"aGVsbG8="
]
}
},
"SizeBytes":26,
"StreamViewType":"NEW_AND_OLD_IMAGES",
"SequenceNumber":"111",
"Keys":{
"id": {
"S": "record-1"
}
}
},
"eventID":"1",
"eventSourceARN":"arn:aws:dynamodb:us-east-1:123456789012:table/fake",
"awsRegion":"us-east-1"
}
]
}

For each event that was passed to it:

  1. Calculate the key (aka filename) to be used for this event. If PlainTextKeyAsFilename is enabled, it will use the following format: “HASH | RANGE”, otherwise it will calculate an MD5 hash of the keys. MD5 is used to minimise hotspots in S3.
  2. Find the table name of the source event.
  3. If MultiTenancyColumn is enabled, find the data relating to the MultiTenancyColumn and use that part of the prefix in S3. IE: It can be used to seperate client data in S3.
  4. Figure out what sort of event this is (PUT/REMOVE)
  5. Build up a request for S3 using the above information and send it over. The body of the request will be NewImage property inside the event if it is a PUT, otherwise it will be empty.
  6. The prefix used in S3 will be: [process.env.BackupPrefix]/[TableName from 2]/[MultiTenancyId if enabled from 3]/[Key from 1]

The following parameters can be passed in to the process:

  1. BackupPrefix — Prefix used for all backups
  2. BackupBucket* — Location of the versioned S3 bucket
  3. PlainTextKeyAsFilename — Whether to use plain text keys as Filename (beware of hotspots created)
  4. MultiTenancyColumn — The attribute name which multitenancy identifier

* required

What about the unhappy path?

In the case that something goes wrong at the Lambda Function, here’s what happens:

Lets say for example a PUT request times out to S3.

Lambda is smart enough to continue retrying the same event in the DynamoDb stream until it passes.

This happens at one minute intervals. CloudWatch will be monitoring as a default, and when an error has been detected, it will log the error, which allows you to trigger alarms / further actions using SNS (Ie: Email notifications to the team).

DynamoDb Streams: In a little more detail

Events in a DynamoDb Stream are distributed between shards.

Make up of a DynamoDB Stream — Shards

Similar to partitions for DynamoDB tables, data is sharded. These can become quite complex because they can split and merge, but lets not go into that.

Within a shard, events have explicit sequences (in other words, event are ordered within a shard), so if an event times out, it cannot proceed with the next event. It will retry that event until successful. It will retry immediately, and back off to retry every one minute.

If the Lambda Function is configured correctly, all events should be stored in the S3 bucket successfully.

The only misconfiguration you can have are:

  1. Incorrect Bucket
  2. Permissions
  3. Incorrect multitenancy column specified (If not in the key, this can be dangerous as it isn’t required)

In our test runs, we have only seen S3 PUT timeouts on one occasion, and that fixed itself with a retry.

S3

As mentioned above, the S3 bucket should be versioning enabled if you wish to enable incremental backups. This allows you to rollback to any point in time.

PUT operation on S3 Versioned Bucket

DELETE on S3 Versioned Bucket

What we’ve created here is basically version control for our DynamoDb data. Essentially an immutable store in S3, which allows us to store all data that was ever in DynamoDb — very cost effectively.

NOTE: We found this to be cost effective, but I strongly advise everyone to double check for their scenario.

In this post, we will walk through the restore step, and I'll be first to admit, this can be taken much further than I have. I haven't had the time, or need to take this as far as I would have liked, but don't let this stop you! I would love to hear from you if you have done something interesting with this, ie - automating your DR / backup restore testing.

For our DynamoDb Incremental backups solution, we have incremental backups stored in S3. The data format is in the native DynamoDb format, which is very handy. It allows us to run it out to DynamoDb with no transformation.

Each key (or file) stored in an S3 versioned bucket is a snapshot of a row at a point in time. This allows us to be selective in what we restore. It also provides a human-readable audit log!

S3 has API available which will allow us to scan the list of backups are available:

Get Bucket Object Versions

Leveraging this, we can build a list of data that we would like to restore. This could range from a row from a point in time, or entire table at a point in time.

There are a range of tools which allow us to restore directly these backups in S3:

Dynamo Incremental Restore
The first option allows you to specify a point in time for a given prefix (folder location) in S3. The workflow is:

It scans all the data available using the Version List in S3

Build a list of data that is required to update.

Download the file(s) required from #2, and push it to DynamoDb

Dynamo Migrator

DynamoDb Replicator
A snapshot script that scans an S3 folder where incremental backups have been made, and writes the aggregate to a file on S3, providing a snapshot of the backup's state.

We haven't had any issues with our incremental backups, but the the next steps would be to automate the DR restore at a regular interval to ensure it provides the protection you are looking for.

We’ve also started discussing creating a synthetic transaction periodically to ensure everything remains in working order. Stay tuned for this.

--

--