Migrating to DynamoDB, Part 1: Lessons in Schema Design

Garrett Heel
5 min readJan 21, 2016

--

This post originally appeared on the Runscope blog and is the first in a two-part series by Runscope Engineer Garrett Heel (see Part 2).

At Runscope, an API performance monitoring and testing company, we have a small but mighty DevOps team of three, so we’re constantly looking at better ways to manage and support our ever growing infrastructure requirements. We rely on several AWS products to achieve this and we recently finished a large migration over to DynamoDB. During this process we made a few missteps and learnt a bunch of useful lessons that we hope will help you and others in a similar position.

Outgrowing Our Old Database

Our customers use Runscope to run a wide variety of API tests: on local dev environments, private APIs, public APIs and third-party APIs from all over the world. Every time an API test is run, we store the results of those tests in a database. Customers can then review the logs and debug API problems or share results with other team members or stakeholders.

When we first launched API tests at Runscope two years ago, we stored the results of these tests in a PostgreSQL database that we managed on EC2. It didn’t take long for scaling issues to arise as usage grew heavily, with many tests being run on a by-the-minute schedule generating millions of test runs. We considered a few alternatives, such as HBase, but ended up choosing DynamoDB since it was a good fit for the workload and we’d already had some operational experience with it.

Migrating Data

The initial migration to DynamoDB involved a few tables, but we’ll focus on one in particular which holds test results. Take, for instance, a “Login & Checkout” test which makes a few HTTP calls and verifies the response content and status code of each. Every time a run of this test is triggered, we store data about the overall result — the status, timestamp, pass/fail, etc.

Example Test Results table in DynamoDB.

For this table, test_id and result_id were chosen as the partition key and range key respectively. From the DynamoDB documentation:

To achieve the full amount of request throughput you have provisioned for a table, keep your workload spread evenly across the partition key values.

We realized that our partition key wasn’t perfect for maximizing throughput but it gave us some indexing for free. We also had a somewhat idealistic view of DynamoDB being some magical technology that could “scale infinitely”. Besides, we weren’t having any issues initially, so no big deal right?

Challenges with Moving to DynamoDB

Over time, a few things not-so-unusual things compounded to cause us grief.

1. Product changes

First, some quick background: a Runscope API test can be scheduled to run up to once per minute and we do a small fixed number of writes for each. Additionally, these can be configured to run from up to 12 locations simultaneously. So the number of writes each run, within a small timeframe, is:

<number_of_locations> × <fixed>

Shortly after our migration to DynamoDB, we released a new feature named Test Environments. This made it much easier to run a test with different/reusable sets of configuration (i.e local/test/production). This had a great response in that customers were condensing their tests and running more now that they were easier to configure.

Unfortunately this also had the impact of further amplifying the writes going to a single partition key since there are less tests (on average) being run more often. Our equation grew to

<number_of_locations> × <number_of_environments> × <fixed>

2. Partitions

Today we have about 400GB of data in this table (excluding indexes), which continues to grow rapidly. We’re also up over 400% on test runs since the original migration. Due to the table size alone, we estimate having grown from around 16 to 64 partitions (note that determining this is not an exact science).

So let’s recap:

  • Each write for a test run is guaranteed to go to the same partition, due to our partition key
  • The number of partitions has increased significantly
  • Some tests are run far more frequently than others

Discovering a Solution

After examining the throttled requests by sending them to Runscope, the issue became clear. We were writing to some partitions far more frequently than others due to our schema design, causing a temperamentally imbalanced distribution of writes. This is commonly referred to as the “hot partition” problem and resulted in us getting throttled. A lot.

What partitions in DynamoDB look like after imbalanced writes.
Effects of the “hot partition” problem in DynamoDB.

One might say, “That’s easily fixed, just increase the write throughput!” The fact that we can do this quickly is one of the big upshots of using DynamoDB, and it’s something that we did use liberally to get us out of a jam.

The thing to keep in mind here is that any additional throughput is evenly distributed amongst every partition. We were steadily doing 300 writes/second but needed to provision for 2,000 in order to give a few hot partitions just 25 extra writes/second — and we still saw throttling. This is not a long term solution and quickly becomes very expensive.

A DynamoDB table with 100 read & write capacity and 4 partitions. As the number of partitions grow, the throughput each receives is diluted.

It didn’t take us long to figure out that using the result_id as the partition key was the correct long-term solution. This would afford us truly distributed writes to the table at the expense of a little extra index work.

Balanced writes — a solution to the hot partition problem.

Part 2: Correcting Partition Keys

In Part 2 of our journey migrating to DynamoDB, we’ll talk about how we actually changed the partition key (hint: it involves another migration) and our experiences with, and the limitations of, Global Secondary Indexes. If you have any questions about what you’ve read so far, feel free to ask in the comments section below and I’m happy to answer them.

--

--