DynamoDB Backup with EMR
Backups. Everyone needs them, everyone hopes they never need them! We use AWS DynamoDB quite a bit at Signiant with close to 200 tables in production. Dumping a single table is pretty straightforward using AWS’ built-in solution that relies on data pipeline. We’ve gone through several iterations of how we backup this data in a reasonable time and have a solution that we’ve been using for a while now that seems to work
In the Beginning
When we first started using DynamoDB, we didn’t have a lot of tables or a lot of data. So we had a pretty simple solution of a .NET tool we had created that dumped table data to XML. This was just scheduled in the Windows scheduler as a batch file. It worked but we didn’t handle failures well and because all the tables were backed up sequentially, it started to get slow.
Our first cut at improving things was to wrap the .NET tool in a script which handled failures, retried operations, etc. This drastically improved our backup success rate but it was still getting slower as we got more data and added more tables.
Next, we added some parallelism to the table dumps using some threads. This decreased the backup time by a huge factor but it was cumbersome to manage new tables being added and it all ran on a single EC2 instance writing to local storage which quickly became the bottleneck.
In true “this is all crap, I need to re-write everything” fashion so often found in the software development world, it was time for a re-think of the whole process. We took a look at using the DynamoDB/DataPipeline solution to dump the table data but that used a single Elastic Map Reduce (EMR) cluster per table. With over 200 tables to dump, that was going to get expensive. However, we were able to reverse-engineer what Data Pipeline was doing to dump the table and that led to the current solution. Which is…
- Using a small python/boto script, we grab a list of all the tables to be dumped and generate an EMR ‘step’ for dumping this table. The list of tables can be filtered by a prefix (ie. all tables beginning with PROD). Each EMR step uses the AWS provided library to access DynamoDB and write the dumped data to S3.
- Within each step, we also have the option to “spike” the read throughput before the dump takes place and lower it when it’s complete. This saves a huge amount of time while at the same time, keeps the DynamoDB costs as low as possible for provisioned throughput
- We then create a new EMR cluster and pass in the steps and the other EMR configuration data needed. The EMR cluster can be configured with more or less workers as needed and essentially does all the work at this stage. Clusters that fail to start are retried.
- A watchdog monitors the state of the cluster and all it’s steps and reports the status at the end.
In our case, we’re able to sub-divide our tables into 6 different “prefixes” so we can backup 6 sets of tables in parallel using 6 EMR clusters. All of the clusters are spun up on demand (using spot instances) and torn down when the dumps complete. Restores are just the reverse process.
We’ve been using this process for around 2 years now and it’s been incredibly reliable. Since switching to spot instances late in 2016, we’ve reduced the costs by around 90% while still keeping the same level of performance.
The scripts and ‘moving parts’ for this solution are available in a github repo: signiant/dynamodb-emr-exporter
As well as making sure we have a backup of all our DynamoDB content, we also replicate it using a replication solution we developed. You can read all about that in All of your replication servers are belong to us