Backing up an Amazon Web Services DynamoDB

by Annette Wilson

Sketch by Annette Wilson

At Skyscanner we make use of many of Amazon’s Web Services (AWS). I work in the Travel Content Platform Squad, who are responsible for making sure that re-usable content, like photographs of travel destinations, descriptive snippets of text and whole articles can be used throughout Skyscanner. That might be on the Skyscanner website, in the mobile app, in our newsletter or in whatever cool ideas our colleagues might come up with. Recently we’ve been evaluating Amazon’s DynamoDB to see if it might be appropriate as the primary means of storing data for a new service. If we use DynamoDB as a primary store, and not just a cache for something else, we’ll need to keep backups. But it wasn’t clear how best to take these.

After investigating the options and trying them out I wrote a summary for my colleagues in an internal blog. This is a lightly adapted version of that summary. I’ll warn you now, it’s quite long!

Before you get stuck in. Hear more updates about our vacancies and get more case studies from us right to your inbox by signing up for our Skyscanner Engineering newsletter waiting list!

Backups with Data Pipeline

DynamoDB doesn’t really support backups as a first-class operation. Amazon suggests using Amazon Data Pipeline. You can fairly quickly and easily create a new data pipeline to perform backups through the AWS web console. Unfortunately, this masks a lot of complexity, which can lead to some surprises when you want to customise it to your needs.

So, what is Data Pipeline? It’s a system to automate and schedule the construction, operation and teardown of Amazon Elastic Map-Reduce (EMR) clusters. What is EMR? It’s a system to provision Hadoop and related big data applications onto managed Elastic Compute Cloud (EC2) instances.

Data Pipeline overview

Basically, when the backup data pipeline runs, it creates some big virtual machines in the Amazon cloud, installs and configures Hadoop on them, then runs a script on this cluster to download every item in the Dynamo table and write it to an Amazon Simple Storage Service (S3) bucket, with the capability to spread the work across many workers in parallel. Finally, it tears all this down and sends a notification email to say if it was successful.

Unfortunately, the data pipeline produced by the web console doesn’t quite meet our requirements:

  • By policy we do not use the web console for deployments in production, and strongly prefer to use CloudFormation
  • We want to put our backups into Skyscanner’s isolated ‘data’ AWS account
  • By policy we do not grant AWS services broad permissions that would allow them to interfere with the private resources of other services running in the same AWS account

The first of these problems has been solved already by some of our colleagues in the Traveller Communication Squad, who have transformed the data pipeline into a CloudFormation script. This is in itself a considerable feat — the transformation between the two representations is not entirely obvious, and results in something that’s pretty hard to read. We took their CloudFormation template as a starting point.

The next problem is writing backups to (and reading backups from) Skyscanner’s AWS data account. The CloudFormation template sends the backups to a bucket that it creates in the same AWS account.

Data pipeline copies data from the DynamoDB to an S3 bucket in the same account

This is okay, but it doesn’t make our backups as safe as we’d like. In particular, if the production AWS account is catastrophically compromised, an attacker could delete both the database and the backups. Similarly, it’s possible to conceive of runaway script accidents that delete both the database and the backup at the same time.

Instead, we prefer to send the backups to a separate data account, where we ensure that no code running in the production account ever has permission to delete the backups or the backup bucket. It can add data, but it cannot delete or change existing data.

Data pipeline copies data from the DynamoDB to an S3 bucket in the same account

Data pipeline copies data from the DynamoDB to an S3 bucket in the same account

Cross-account S3 bucket access

Writing to S3 like this cross-account presents a number of problems. Firstly, we have to find a way to grant the data pipeline permission to write to the bucket in the data account. After that, we’ll worry about making sure it can be read back again.

In AWS, there are three kinds of Identity and Access Management (IAM) entities that can be granted permissions to perform actions: users, groups and roles. Our CloudFormation template creates a role for the data pipeline to assume. The role will grants it permission, among many other things, to read from the DynamoDB table and write to the S3 bucket. This is fine when the bucket is in the same account, but in the cross-account scenario it is insufficient. It would be a pretty useless system if anybody could sign up for an AWS account and then create IAM users in that account with permission to manipulate resources in other AWS accounts. Instead we need cooperation between the two accounts. A bucket policy in the data account can grant permissions to the root user in the production account, and then a policy on the data pipeline role in the production account can delegate that permission to the role.

A bucket policy in the data account can grant permissions to the root user in the production account, and then a policy on the data pipeline role in the production account can delegate that permission to the role.

Sadly, our permission struggles did not end here. There are a number of fine-grained actions related to S3 buckets, permission for which may be granted or denied. For example, the PutObject action writes data into the bucket, GetObject reads it back out. ListBucket lists the contents of the bucket. DeleteObject, surprisingly enough, deletes an object from the bucket. We want our data pipeline to write objects, but not to delete objects. (It’s safe enough for it to overwrite an object, because we have object versioning enabled.) Various other operations allow multi-part uploads, versioning and access control. However, when we actually tried to run the pipeline after granting it the obvious permissions, it failed with an uninformative S3 permission exception. It didn’t tell us what action it was trying to perform or what resource it was trying to act upon.

To figure out what was going on, we created a bucket in the production account and enabled S3 access logging. Well, that’s not quite true. First we scrabbled around trying to find more informative logs buried away in the data pipeline interface, but there were none to be found. We looked for ways to increase the logging detail in Hadoop, but couldn’t find anything useful. Eventually we hit on the idea of turning on S3 access logging.

S3 access logging will record (almost) all S3 actions, including the type of action invoked, the resources involved and the entity invoking it. We ran the data pipeline backing up to this same-account bucket and reviewed the logs. It became clear that the data pipeline creates and deletes temporary files to represent empty folders. (S3 itself has no concept of folders or directories.) Thankfully these files have a distinctive suffix, and we were able to grant a very limited DeleteObject permission that only applies to these files and nothing else in the bucket.

Cross-account S3 object ownership

That’s all great, and is enough to save and even restore backups, but it suffers from a subtle but critical flaw. The backups are only readable by the account that wrote them. This is frustrating when it stops us from restoring a production backup into a test account to run destructive tests using production data. It could, however, render the backups useless if something bad happens to the production account, which is exactly why we’re writing them into another account in the first place. If some catastrophe prevents us from using the production account anymore, we want to be able to recreate everything in a fresh account.

The reason that the backups cannot be read from another account is that every S3 object is owned by the account that created it — or rather, by the account that contains the user or role that created it. By default, only the owner of the object can manipulate it. Even the bucket owner cannot do anything to the object without the object owner’s permission!

S3 layers of permissions

The solution is for the object owner to apply an ACL. The bucket-owner-full-control ‘canned’ ACL can be applied when the object is created to give the bucket owner (the data account) all the same permissions as the owner. Once the bucket owner has these permissions, the bucket policy can effectively delegate them to other accounts.

The only difficulty is figuring out how to instruct the data pipeline to use canned ACLs. In fact, this is a configuration option for Hadoop, fs.s3.canned-acl, but in order to apply it we need to understand how to tell data pipeline to pass configuration options to Hadoop, and we have to understand how to transform this part of the data pipeline specification into a CloudFormation template.

While this sounds straightforward, consider first the layers upon layers of systems we have involved.

Every layer is a leaky abstraction. It introduces its own concepts and methods of configuration, without removing the need to understand those of the layer below. So when we really just want to put this in a Hadoop’s core-site.xml:

<property> <name>fs.s3.canned.acl</name> <value>BucketOwnerFullControl</value> </property>
<name>fs.s3.canned.acl</name>
<value>BucketOwnerFullControl</value>

We need to understand that EMR expects this:

{ “classification”:”core-site”, “properties”: { “fs.s3.canned.acl”: “BucketOwnerFullControl” } }
“classification”:”core-site”,
“fs.s3.canned.acl”: “BucketOwnerFullControl”

And Data Pipeline expects this:

… { “name”: “EmrClusterForBackup”, “id”: “EmrClusterForBackup”, … “configuration”: { “ref”: “EmrClusterConfigurationForBackup” } }, { “name”: “EmrClusterConfigurationForBackup”, “id”: “EmrClusterConfigurationForBackup”, “type”: “EmrConfiguration”, “classification”: “core-site”, “property”: [{ “ref”: “FsS3CannedAcl” }] }, { “name”: “FsS3CannedAcl”, “id”: “FsS3CannedAcl”, “type”: “Property”, “key”: “fs.s3.canned.acl”, “value”: “BucketOwnerFullControl” }
“name”: “EmrClusterForBackup”,
“id”: “EmrClusterForBackup”,
“ref”: “EmrClusterConfigurationForBackup”
“name”: “EmrClusterConfigurationForBackup”,
“id”: “EmrClusterConfigurationForBackup”,
“type”: “EmrConfiguration”,
“classification”: “core-site”,
“key”: “fs.s3.canned.acl”,
“value”: “BucketOwnerFullControl”

And finally CloudFormation expects us to render it like this:

{ “Id”: “EmrClusterForBackup”, “Name”: “EmrClusterForBackup”, “Fields”: [ … { “Key”: “configuration”, “RefValue”: “EmrClusterConfigurationForBackup” } ] }, { “Id”: “EmrClusterConfigurationForBackup”, “Name”: “EmrClusterConfigurationForBackup”, “Fields”: [ { “Key”: “type”, “StringValue”: “EmrConfiguration” }, { “Key”: “classification”, “StringValue”: “core-site” }, { “Key”: “property”, “RefValue”: “FsS3CannedAcl” } ] }, { “Id”: “FsS3CannedAcl”, “Name”: “FsS3CannedAcl”, “Fields”: [ { “Key”: “type”, “StringValue”: “Property” }, { “Key”: “key”, “StringValue”: “fs.s3.canned.acl” }, { “Key”: “value”, “StringValue”: “BucketOwnerFullControl” } ] }
“Id”: “EmrClusterForBackup”,
“Name”: “EmrClusterForBackup”,
“RefValue”: “EmrClusterConfigurationForBackup”
“Id”: “EmrClusterConfigurationForBackup”,
“Name”: “EmrClusterConfigurationForBackup”,
{ “Key”: “type”, “StringValue”: “EmrConfiguration” },
{ “Key”: “classification”, “StringValue”: “core-site” },
{ “Key”: “property”, “RefValue”: “FsS3CannedAcl” }
{ “Key”: “type”, “StringValue”: “Property” },
{ “Key”: “key”, “StringValue”: “fs.s3.canned.acl” },
{ “Key”: “value”, “StringValue”: “BucketOwnerFullControl” }

Extracting this information from the documentation wasn’t exactly straightforward, and it’s still pretty hard to understand at a glance. At this point we decided to stop and take stock. We had something that demonstrably worked, but still required a set of permissions too broad to deploy in the real production account. We’d need to whittle these down to remove all the wildcard resources. At this point, the cons of data pipeline definitely felt like they were outweighing the pros:

Pros

  • Schedules and executes backups
  • Easy notifications
  • Should scale to massive table sizes

Cons

  • Very slow to start up — each attempt to run a backup takes a minimum of 15 minutes, by default retries three times so a failure can take 45 minutes
  • (Relatively) expensive — charges a minimum of one hour of usage of quite a large EC2 instance for every backup (EMR doesn’t let us use a smaller instance size)
  • Complex and difficult to understand
  • Still more work required to minimize permissions in production

I sketched this cartoon (go easy on me, I’m a programmer, not an artist) while waiting for Data Pipeline to run and fail for the umpteenth time, in order to vent some of my frustration. It really does feel like Data Pipeline is overkill unless you have a really big table.

Sketch by Annette Wilson

Backups without Data Pipeline

We decided to investigate other options for backup. Fundamentally, the backup operation is not complex. It needs to perform a scan operation on the DynamoDB table, reading records and writing them to S3. It should be aware of the table’s provisioned throughput to avoid saturating it and preventing anyone else from reading the table.

Without Data Pipeline, we also need a means to schedule backups. We considered using an AWS Lambda on a schedule, but we weren’t sure that the backup process would be completed within the time limit for Lambda. In the end, we decided the simplest thing was to run a scheduler in a Docker container in our existing EC2 Container Service (ECS) cluster.

We ruled out Cron as a scheduler, since it doesn’t play nicely with Docker — you need to jump through hoops to stop it from running in the background, changing user and throwing away all the environment variables. We settled on APScheduler for scheduling and we found a simple script for doing the backup, dynamo-backup-to-s3.

We still needed to address the same permission problems we saw with Data Pipeline:

  • The script needs to run as a role with permission to write to the bucket
  • The bucket policy must also allow the operations
  • We need to set an appropriate ACL on any objects stored into the bucket

Thankfully, this time it’s much easier to see what permissions the script will need. The ACL is slightly tricky, since the script doesn’t set one, but it turns out to be an easy fix. This quick hack below was good enough to prove the concept; a more general purpose solution would be to expose this as a configuration option.

var upload = new Uploader({ accessKey: self.awsAccessKey, secretKey: self.awsSecretKey, region: self.awsRegion, bucket: self.bucket, objectName: path.join(backupPath, tableName + ‘.json’), stream: stream, — debug: self.debug + debug: self.debug, + objectParams: { + ACL: ‘bucket-owner-full-control’ + } });
var upload = new Uploader({
accessKey: self.awsAccessKey,
secretKey: self.awsSecretKey,
objectName: path.join(backupPath, tableName + ‘.json’),
+ ACL: ‘bucket-owner-full-control’

We like this solution because:

  • We can see and understand all the parts involved
  • It uses Docker, for which we already have great internal tools to manage deployments
  • It took less time and was less confusing to set up than Data Pipeline
  • It runs on the ECS cluster we already have set up

It does have some drawbacks:

  • It is harder to scale if the table gets really big
  • We need to do a bit more work if we want email notifications
  • We need to take care that our ECS cluster has enough capacity

So there you have two different ways to back up a DynamoDB table. We find AWS really powerful in general, and it’s usually easy to find the documentation and examples you need, but once you stray from the most well-trodden paths — with requirements like operating across multiple accounts or using CloudFormation for all infrastructure — it can become difficult to find anyone who’s done quite the same thing before.

We’re hiring!

Skyscanner Engineering Roles now available!

Our team has a great mix of people with different interests and specialisms — we’ve got a few roles available so why not take a look at the latest Engineering vacancies we have available?