Set it and…Forget it

Dave North
Signiant Engineering
6 min readSep 21, 2017

DynamoDB autoscaling in action

We’ve been using AWS DynamoDB for a while at Signiant as our main data store for our SAAS applications. In fact, we were actually on the original beta program for it many years ago now. DynamoDB does come with a bunch of challenges in using it but there’s no doubt in my mind, it reduces the ops load of running a database by a factor of 90%. It just works, requires none of the traditional database maintenance a relational database requires (postgres vacuum anyone?!) and performance is great. There is one problem though….

Spikes

Spikes. Read and write spikes. DynamoDB tables are setup not by how much storage capacity you need but by how much read and write capacity you need. This is expressed in terms of provisioned read and write throughput. You can burst over your provisioned limit for a short period but in general, if you hit the provisioned rate, you’ll be told to go away and slow down your read or write rate. Take a look at this table:

Here, the write pattern to the table is very uneven and “spiky”. This can cause havoc if your application cannot gracefully handle this. With a single table, it’s manageable to manually increase provisioned throughput but in our case, we have close to 150 DynamoDB tables so that clearly wasn’t going to work (unless we had a room of monkeys with keyboards!)

Autoscale All the Things

Back in 2013, we solved this problem by writing our own DynamoDB autoscaler. It worked by waking up every 5 minutes, getting a list of all tables, checking the current read or write throughput and automatically increasing the provisioned throughput if the current use was high enough. This worked fantastically well and solved a good number of issues we were continually having in various parts of our applications (ref: the number of tables we have) that were caused by throughput exceptions. Earlier this year though, AWS introduced autoscaling functionality into DynamoDB using the application autoscaling functionality. After doing some testing with this, it was clearly superior to our old (but well used!) tool in that it was configured per table and it could scale throughput down. We did not build scale down into our original tool because while you can scale up a table an unlimited amount per day, you can only scale down a limited number of times per day. In 2013, that was 2 times. It’s currently 8 times per day.

Migrating to Native DynamoDB Autoscaling

As with most things, moving to a new way of doing things wasn’t quite a matter of checking a box. While the AWS console is literally a checkbox on each table to enable auto-scaling, we have 150 tables. And our dev teams regularly add new tables so we want autoscaling automatically enabled for new tables. Oh and we’d like to know when tables are scaling.

It turns out that while it’s a single check-box in the AWS console, like a lot of AWS features, there’s quite a bit going on behind the curtain. DynamoDB autoscaling relies on a feature called application autoscaling (this is also used for ECS autoscaling) while requires configuring a few things to make this work. Enabling this on 150+ tables was going to be some work, so we created a small utility to enable/disable autoscaling on groups of tables. This was a great learning exercise as it allowed us to see exactly how autoscaling worked behind the curtain.

Here’s what a table looks like now we have autoscale enabled for read:

Always Autoscale

After enabling autoscaling on the existing tables and making sure it works as expected, we wanted to make sure that all new tables have autoscale enabled on them. We currently create tables using cloudformation as part of our deployment pipeline so the first thought was to have dev teams add the autoscaling option to the templates. However, the amount of extra bits and pieces to add per table (targets, scaling policies) is pretty ornery (especially if your table has indexes). So we crated a lambda function hooked up to a Cloudwatch event. This is triggered whenever a new table is added and checks if autoscaling is enabled for the table. If not, it turns it on.

The interesting learning experience here was how granular you can get with Cloudwatch API events. Take a look at the event pattern:

{
"source": [
"aws.dynamodb"
],
"detail": {
"eventSource": [
"dynamodb.amazonaws.com"
],
"eventName": [
"CreateTable"
]
}
}

Here, we’re able to have the event fire a target if it’s coming from dynamodb, only for the CreateTable API action. When this condition is met, the Lambda function is executed which does the leg work to ensure that autoscaling is enabled on “this” table and/or indexes.

See the Scaling

Our legacy table autoscaling tool used to send us an email when it was manipulating throughput. This was handy really just as a “is this really working?” notification. Native DynamoDB autoscaling documentation talked about there being an SNS topic we could subscribe to but after talking to support, it seems this doesn’t actually exist. And who wants lots of email these days?

We use Slack. A lot. So what if we could get those notifications into a slack channel? Turns out, we can use a similar strategy to capturing the new table events using Cloudwatch events. In this case, we want to know when autoscaling has triggered an update to the table. AWS support gave us a pointer to what the event rule would look like which ended up being:

{
"source": [
"aws.dynamodb"
],
"detail": {
"eventName": [
"UpdateTable"
],
"userAgent": [
"application-autoscaling.amazonaws.com"
]
}
}

Here, we’re able to have the event fire a target if it’s coming from dynamodb, only for the UpdateTable API action and only if the service triggering the event is application-autoscaling. When this condition is met, a Lambda function is executed which formats up a message to post to slack. The Slack output looks like

We’ve found that with a large number of tables, DynamoDB autoscaling scales a lot. So we have a dedicated Slack channel for this bot to post to so our main channels are not full of autoscaling goodness

There is…Just One More Thing

We’ve written before how we backup larger number of DynamoDB tables using EMR. With this process, we actually “spike” the provisioned read throughput of a table before we dump the content so we can backup the table in a reasonable time. Enabling autoscaling on this table just caused the throughput to be lowered right away. So we needed a way to pause autoscaling before we backup the table. This would be a great feature in DynamoDB but it’s not there today. What we ended up having to do was modify the scaling policy for each table before we back it up to set the minimum throughput to be our spiked value. You can see this change with this commit. It would be great if there was some simple way to use pause autoscaling on a table though.

Referenced Projects

--

--