How Rewind Manages Data Using an Army of Fargate Spot Workers

Published in

BigCommerce Developer Blog

10 min readDec 23, 2019

At Rewind, we backup SAAS application data. Quite a lot of it. We take a lot of care in how we secure and manage the data we store because we’re merely its custodians for a short duration of time on behalf of our customers. In this article, I’d like to talk about how we manage expiration of data in a scalable way by taking advantage of AWS Fargate container scaling and implementing a “creative” way of using S3 lifecycle rules. First though, a quick introduction to the types and amount of data we are storing.

All of the Data

Currently, we’re using AWS S3 as the main data store for backups. S3 is some of the most cost effective storage on the planet and comes with some good features to help manage the data stored. While we don’t disclose the exact amount of data we have under backup, it’s in the hundreds of TB and growing at a pretty quick rate as we onboard more customers.

The challenge presented in managing the data is its size. It’s typically very small files (objects) and because we are a backup application that keeps multiple versions of objects, there are a lot of them. Tens of billions as I writre this to be exact. So how do we manage this?

Long Term Management (aka the easy part)

As part of our service agreement, we commit to maintaining backed up versions of all objects for 1 year. This means as a Rewind customer, you can rewind back to any version of any object for 1 year. Typically, most rewind operations are dealing with data from the past 7 days but in the off-chance you want to restore data from 9 months ago, we have you covered. Managing this data expiration pattern is incredibly easy using S3 lifecycle rules. Lifecycle rules let you manage data according to time based criteria. For example, deleting previous versions of an object 365 days after it has become a new version.

We manage all of our infrastructure using terraform. Heres a small extract from one of our our bucket definitions:

resource “aws_s3_bucket” “platform_regional_bucket” {bucket = “${lookup(platform_regional_bucket, local.platform, “${lookup(local.platform_regional_bucket,”default”)}”)}” # Specific name or our default  acl = “private”  versioning {
    enabled = true
}lifecycle_rule {
  id = “Move to Infrequent Access, Permanently delete after 365 days”
  enabled = true
  abort_incomplete_multipart_upload_days = 7  noncurrent_version_expiration {
    days = 365
  }  noncurrent_version_transition {
    days = 30
    storage_class = “STANDARD_IA”
  }  transition {
    days = 30
    storage_class = “STANDARD_IA”
  }
}

There’s more to this that I didn’t include but the important piece here is the lifecycle_rule definition. We expire the older versions after 365 days (but always keep the latest version) and also take advantage of tiering the data to the S3 Infrequent Access (STANDARD_IA) storage class. IA storage is more cost effective for data that will not be accessed with a high frequency which is perfect for backup data older than 30 days. There is a cost to transitioning to IA storage (and it is not insignificant for our use case because it is based on the number of objects, not the amount of data stored) but we generally see a payback in around 3 months by moving data to the IA tier.

So, that’s it for longer term data management. Pretty simple — just let the lifecycle rule do the heavy lifting and it’s hands off. But we have another, more interesting, case that involves a creative solution…

Short Term Management (aka the trickier part)

What happens when we need to remove data that has more than a time-based component to it? For example, if a customer uninstalls Rewind, we’d like to remove that data within a reasonable timeframe. Or if a customer submits a GDPR right to erasure request, how do we accomplish this?

Initially, I figured this would be easy — a small script to get information about the data we need to remove from our database followed by removing it from the S3 bucket using the AWS CLI. It wasn’t quite that simple. Recently, I had to remove a bucket with a large number of objects and found that using the CLI or even deleting via the AWS console would fail every time. I ruled out this approach.

I had a look at S3 Batch which looked like it would be perfect — submit a manifest and an operation to perform and S3 batch would do the heavy lifting. But alas, Batch does not support deletes. While it supports running a Lambda function per object, with billions of objects this would eat into our Lamba concurrency and add unnecessary cost. It also adds more API calls to S3 which are chargeable actions (lifecycle rules incur no cost). I ruled out Batch for now (until it supports deletes — which I understand from chatting to a Re:Invent presenter is a hot request!).

Finally, while I was riding the train home one night, I had an idea. What if we could manipulate lifecycle rules on a bucket “on the fly” — leave them in place while they remove data and then remove them afterwards. It would need to be scalable as there could be some bursts of data to remove for many accounts concurrently and I don’t want to spend a lot of dollars on this.

Winston Wolfe

Pulp Fiction. A classic. Harvey Keitel played this memorable character who was a cleaner. I needed a cleaner. So, I created the Wolf. The Wolf uses a few different AWS services operating together to manage the data we need to remove.

The core service used to do all of this is AWS Fargate. Fargate is (IMHO) the easiest way to orchestrate containers on AWS. As a long time AWS EC2 Container Service (ECS) user, I’m always amazed at how simple Fargate has made managing containers and being able to deploy large numbers of containers in a snap. I was able to use the new Fargate Spot capability to make running all of this super cost effective.

Let’s drill into some details about how this all works. Alas, the solution contains some proprietary code so it is in a private repo but I’ll include some relevant code snippets where appropriate. The whole thing is essentially a couple of bash scripts using the AWS CLI packaged into a single Docker container (Alpine Linux — lightweight!) with the entire deployment in a Cloudformation template.

Here’s the diagram. There are a few moving parts so let me explain them

The 2 basic components are the dispatcher and the purger. These are packaged into a single Docker container that just has 2 different entry points. Two ECS Fargate task definitions are created to describe how to launch either the dispatcher or the purger(s).

In step 1, the dispatcher starts up as an ECS scheduled task and connects to our main account database for a list of accounts that should have their data removed.
The dispatcher then starts up a new Fargate (spot) task to purge this data (the purger). There can be many purgers running concurrently depending on how many accounts we need to process. The dispatcher accepts a parameter for the maximum number of purgers that can be in flight at any one time due to limits on S3 lifecycle rules per bucket.

A quick note on how Rewind data is segmented: We store data in the same region as the source we are backing up. I.e. If a customer has a store in the EU, we have all of the infrastructure to store and process this in the eu-west-1 (Ireland) AWS region. As such, when we start the Fargate purger tasks, we ensure to start them in the region the data is stored in. This is the beauty of Fargate. With “ECS v1”, we’d have needed a cluster of instances running to run these containers. Most of the time, they’d be idle or highly under utilized. But with Fargate being totally server less (and Fargate spot being incredibly cheap), we need no capacity on the bench.

Here’s how the dispatcher starts a purger container for a specific account using the Fargate Spot capacity provider:

purger_task_arn=$(aws ecs run-task \
  --count 1 \
  --capacity-provider-strategy capacityProvider=FARGATE_SPOT,weight=1,base=100 \
  --cluster "${cluster_name}" \
  --task-definition winston-wolfe-purger \
  --propagate-tags TASK_DEFINITION \
  --enable-ecs-managed-tags \
  --overrides file:///tmp/purger_overrides.json \
  --network-configuration "awsvpcConfiguration={subnets=[${subnet_id}],securityGroups=[${winston_sg_id},${db_sg_id}]}" \
  --query 'tasks[*].taskArn' \
  --region "${account_region}" \
  --output text)

We pass the account ID and some other data into the task by making use of overrides to the task. Essentially, this allows us to specify different values for environment variables which are passed to Fargate containers.

I make no secret of the fact that I’m a huge fan of the AWS CLI and find that with it’s --query option, writing scripts around it to perform powerful actions is a snap.

3. The purge. So we have a Fargate container running, what does it do? Primarily, it removes data from S3 but there are a few other data stores we clean at the same time. The S3 removal is accomplished by dynamically manipulating lifecycle rules on the bucket.

The lifecycle manipulation is a 2-pass solution:

Pass 1 adds a lifecycle rule for the account prefix to purge (data is organized in S3 by prefixes mapped to the account ID). It does not mark the account as purged
Pass 2 re-visits the same account ID and checks if the data purge is complete. If it is, the rule is removed and the account is marked as purged.

Here’s where I ran into a bit of a snag. See, in the AWS console, the lifecycle rules management looks like this:

It looks like you can add and remove individual lifecycle rules, whereas in fact, the call to get/set lifecycle rules gets/sets the entire set! This is a problem because we have multiple containers running in parallel operating on the same bucket lifecycle rules. Some kind of lock is required to prevent concurrent operation on a given buckets lifecycle rules.

DynamoDB — The Wonder Service

A solution I had used before was using a row in a DynamoDB table as a lock by using a conditional update. Conditional updates will fail in DynamoDB if the attribute being updated does not match the condition specified. So the Winston cloudformation table creates a DynamoDB table with just a single row per S3 bucket we need to manage in the region. Obtaining a lock for the bucket then just looks like this:

aws dynamodb update-item \
  --table-name "${ddb_lock_table}" \
  --key "{\"bucket\": {\"S\": \"${s3_bucket}\"}}" \
  --update-expression "SET lock_state = :new_state" \
  --condition-expression "lock_state = :unlocked_state OR attribute_not_exists(lock_state)" \
  --expression-attribute-values '{ ":new_state": { "S": "locked" }, ":unlocked_state": { "S": "unlocked" } }'

This is just run in a loop, checking the return code. If we were able to perform the update, we have “obtained the lock” and can carry on. If the update fails, we retry for some time until we can perform the update. Unlocking is just the reverse condition.

Manipulating Lifecycle Rules

So, how do we then go about manipulating the rules? Using the lock,it’s fairly straightforward with the CLI:

aws s3api put-bucket-lifecycle-configuration \
  --bucket "${s3_bucket}" \
  --region "${AWS_REGION}" \
  --lifecycle-configuration "file://${lifecycle_config_json_file}"

The lifecycle config JSON file always contains the full set of rules. I was able to manipulate this using jq. Removing a rule from the file is done as follows:

jq --arg aid "${ACCOUNT_ID}" \
  'del(.Rules[] | select(.ID == $aid))' "${bucket_lfecycle_rules_file}" \
> "${new_bucket_lifecycle_rules_file}"

And adding a rule is done using:

jq \
  --arg aid "${ACCOUNT_ID}" \
  --arg pid "${PLATFORM_ID}" \
  --arg days "${purge_after_days}" \
  '.Rules += [{"Filter":{"Prefix":$pid},"Status":"Enabled","NoncurrentVersionExpiration":{"NoncurrentDays":$days|tonumber},"Expiration":{"Days":$days|tonumber},"ID":$aid}]' \
"${bucket_lfecycle_rules_file}" >
"${new_bucket_lifecycle_rules_file}"

Putting this all together, the flow when the purger starts is:

Lock the lifecycle rules for the bucket (using our DynamoDB lock)
List the lifecycle rules on the bucket
If a rule already exists for the account id, check if the content is gone from S3. If it is, mark this account id as being purged. If not, leave the rule and exit
If a rule does not exist for this account_id, add one and purge the data from RDS/Elastic. Do NOT mark the account as purged.

Results

That’s about it! We’ve got a scalable solution that manages shorter-lived customer data. And we’re not paying a significant amount on our AWS bill to do it.

While manipulating lifecycle rules works in this case for us, you do need to be careful because while the rule is in place, it’s removing ALL of the data matching the filter. With an expiry criteria of 1 day, any new data will be removed as well. If and when S3 batch supports deletions, we will move the solution over to use it as it’s much more predictable to have a known set of data to operate on. Until this, the “creative” manipulation of lifecycle rules works well.