Dynamodbdump: easy DynamoDB backups using Docker and Go

Introduction

If you’re like us, you’re not a big fan of losing data. Do you store data in DynamoDB? Then surely want to back them up.

And sometimes you also even want to restore them either in a test environment or in the original environment after some “Oops, I slipped” or “Yet another service where things magically disappear” kind of situation.

We didn’t feel anything openly available was a great fit for our needs (more why below), so we rolled our own. With this post, we are explaining our thinking and announcing availability of our open-sourced solution.

How to backup DynamoDB? What’s already out there

AWS Data Pipeline

The AWS Data Pipeline has been the prescribed, canonical, AWS One True Way (TM) way to backup the DynamoDB tables since the product was introduced.

Documentation: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb.html

How does it work?

The pipeline creates an EMR cluster, reads the table and pushes the records to S3 files in 10MB chunks, then shuts down the EMR cluster. The restore of a backup can also be done using a Data Pipeline and also uses an EMR cluster.

Pros

  • Built into AWS
  • Simply outputs a set of files in S3 that are easy to serialize and reuse

Cons

  • When you do daily backup of small tables it is very costly as the time you lose on EMR is quite high for small backups
  • Has not evolved much over the years
  • You have to know the original table definition and the new table has to be existing
  • Not open-source

What you pay for

  • Data Pipelines
  • EMR clusters run time
  • S3 storage
  • DynamoDB IO

AWS on-demand backups

This brand new functionality has just been announced during re:invent 2017 and not in general availability at the time of writing.

Documentation: https://aws.amazon.com/blogs/aws/new-for-amazon-dynamodb-global-tables-and-on-demand-backup/

How does it work?

A snapshot of the storage is taken in the background. When restoring, it’s a basic snapshot restore.

Pros

  • Built into AWS
  • Backup time is almost instantaneous
  • Backup of table structure is included

Cons

  • You can’t reuse it for sampling and doing operations on it to, for example, send the data in a staging database while obfuscating the records
  • Not GA at the time of the writing and not announced for every regions
  • Not open-source

What you pay for

  • Snapshot storage

Dynamodump

This is an old python2 application that has been around for quite a while but doesn’t appear to be actively maintained.

Documentation: https://github.com/bchew/dynamodump

How does it work?

Scans the entire table and writes the records to S3. Does the reverse operation for the restore.

Pros

  • Backup of table structure is included
  • It’s just a set of files in S3 that are easy to serialize and reuse
  • Can sync tables
  • Can compress the backups
  • Open-source

Cons

  • Not compatible with python3
  • No unit tests
  • Not actively maintained

What you pay for

  • S3 storage
  • Instance on which the script is running

Using DynamoDB streams

There are a few open-source tools, mostly written in node.js, that allow you to backup incrementally your DynamoDB tables like dynamodb-replicator.

Documentation: https://github.com/mapbox/dynamodb-replicator

How does it work?

Either scans the entire table when executing a full backup or just read the records from dynamodb streams when executing an incremental backup.

Pros

  • Can do incremental backups if you use streams and lambdas
  • Open-source

Cons

  • Not actively maintained
  • No Docker integration
  • Written mostly for lambda (which didn’t fit our needs)
  • Overly complex for out needs

What you pay

  • S3 storage
  • Instance or lambda on which the script is running
  • DynamoDB streams

Dynamodbdump

Dynamodbdump is a tool written in Go that does an in-place replacement for AWS Data Pipelines. The produced backups are fully compatible with the backup done with AWS Data Pipeline. It is also capable to restore backups into the DynamoDB table of your choice.

Documentation: https://github.com/VEVO/dynamodbdump

How does it work?

Dynamodbdump scans the entire table when executing a full backup.

Pros

  • Backup and restore tables easily to s3
  • Docker images available and has been running in our production environments for several months
  • Resources-efficient
  • Fully compatible with Data Pipeline backups
  • Open-source

Cons

  • Some features like compression are note yet implemented, but that’s the fun side of open-source projects!

What do you pay?

  • S3 storage
  • If you don’t have a kubernetes cluster you’ll probably do those in ECS or EKS or directly from a remote machine which might incur some cost

Why we created Dynamodbdump

Most of the DynamoDB tables we do backups on are very small tables (between 10 and 5000 records). The biggest ones are only a few Gigabytes, so when we were using Data Pipelines, the cost of the EMR cluster was much higher than we were comfortable with. This was even after AWS switched to per second billing. Most of the EMR clusters were running for twenty minutes and only a few seconds of backup, the rest was just cluster setup and teardown.

We looked at a solution that could be easily integrated as a CronJob resource in our Kubernetes cluster instead, but we figured that it would be quicker to create our own so that it exactly fits our needs (with the extra part of having fun with Go!). We also had some idea on some functionalities we wanted to have in a near future (auto-expiration of backups in S3 specific for a table, compression, etc https://github.com/VEVO/dynamodbdump/blob/master/TODO.md).

So the main focus of the requirements were quite simple:

  • it’s 2017, if you are going to build tools for remote execution, it should be containerisable
  • resource-efficient on smaller tables (which is more typical of our workloads)
  • not require schema awareness

After some time in production we decided to opensource the code (https://github.com/VEVO/dynamodbdump) and the docker image (https://hub.docker.com/r/vevo/dynamodbdump/) as it might be useful to others looking for an easy and pre-package solution to save money on their backups.

How to use it?

Dynomodbdump can be used from the command-line as well as from our provided Docker image via your current method of orchestration (such as Kubernetes, EKS or ECS).

Example of local clone, compilation and launch of a backup of the table “mytable” into “s3://my-dynamo-backup-bucket/backups/mytable/${TIMESTAMP}” with “${TIMESTAMP}” being the date at which the backup starts in the following format: “YYYY-mm-dd-HH24-MI-SS”. The read of the table will be done by batch of 1000 records, waiting 2000ms between each batch (twice this time if a “ProvisionedThroughputExceededException” is encountered):

Example of Dyanmodbdump backup script.

Internals of the backup:

How it works!

And here’s how I would restore today’s backup on my new table named “mynewtable” (writing batches of 100 records at a time every 800ms), only appending the records to the table:

Script to append records

Internals of appending the records:

Now the more interesting part comes with the usage in Docker or in an orchestrator. For example, if I want a Kubernetes cronjob that would backup my table every night at 2AM, I would use the following configuration file:

Kubernetes cronjob for nightly back up with Dyanmodbdump.

How to get more information on it?

Go to our GitHub repository to checkout the documentation and example, create issues there, we’ll be happy to answer your questions. You can also leave us a comment below!