Systems Design Notes: Dynamo DB

Ivan Ramirez
5 min readMar 31, 2023

--

What can we learn about systems design from a global, well-known, and massively scalable service like DynamoDB? A lot! And those learnings are available for anyone willing to read their paper and watch their presentation at Usenix 2022.

The premise of their paper and presentation is sharing some of their lessons in the past years in building and operating a service at such a scale, with hundreds of thousands of customers and workloads peaking at 89 million requests per second during Amazon’s Prime Day.

It is clear that building a system for that scale didn’t happen overnight, and it is precious to get their learnings of how they evolved their architecture as they understood more about their customer’s needs.

Many engineering teams need to remember the principle of iterating on a system to make it more reliable, scalable, and easier to maintain. Achieving this iterative approach requires rigorous system data collection, customer feedback, empathy, and patience.

The following are my notes from the paper and presentation. I want to focus on three areas and some interesting ideas they implemented (there are plenty more):

1) Handling traffic patterns

Dynamo DB is a distributed Key-Value store. Data (an item) is stored using a key and an optional sort key and its value. Tables are a collection of items split into partitions by applying a hash function to the key value, stored in storage nodes, and replicated across different availability zones. Nodes contain partitions for multiple tenants.

A request to access an item goes through the Request Router service, which finds the location of a partition and routes it to the respective storage node. The team has undergone a few iterations to handle traffic patterns to access this data, starting with a static access assignment to the table. In this case, if the customer assigned 400 Read Requests per Second to the table, and it had four partitions, each would get 100 requests per second by default.

The team quickly realised this was wasteful, as some partitions might have very little activity while others would be hectic and have to throttle requests. To solve this, most customers would over-provision their capacity.

The second iteration included bursting: the ability to use some of the unused capacity (in windows of 5 minutes) of other partitions for a short period. Bursting proved helpful, but the benefits were only short-lived.

So they introduced adaptive capacity: based on some heuristics and monitoring, the capacity would be allocated dynamically based on usage patterns. Based on the utilisation of the node, when some partitions are identified as very busy, they are shuffled to other nodes to keep the performance consistent and predictable. Adaptive capacity was a good approach; it was reactive.

Adaptive capacity still presented shortcomings, so they developed a Global Admission Control (GAC) service. This service enables data consumption via tokens using the Token Bucket algorithm. Each Request Router instance will keep a local bucket, and GAC provides a global one synchronised periodically. When a Request Router is empty, it sends a request to GAC for more. Tokens are provided based on usage.

On top of this token system, Dynamo DB performs a lot of intelligent balancing that includes parameters like node size, node capacity, noisy neighbours, etc. Certain partitions are split into new ones when they use too much capacity.

2) Availability

Dynamo DB is a critical component of hundreds of thousands of customers that run their businesses on this service. The challenge here is making this service available to their promised numbers (99.99 for tables in one region, 99.999 for global tables)

Every partition has replicas, and together, they form a replication group. These replicas are spread across different availability zones. There are two types of replicas:

  1. Storage replica comprises a write-ahead log and a B-tree with the key-value data.
  2. Log replica is part of the replica quorum but only keeps the write-ahead log.

Replication groups use the Multi-Paxos protocol to maintain its consensus. Leaders retain their status by periodically renewing a lease. When a leader is unreachable, any replica can start a new leader election proposal.

False-positive failure detection (“Grey network failures”) might trigger an unnecessary (and expensive) leader election process. Instances in this state will “chat” with their peers to ensure others can’t see the leaders before starting the election. Failure detection is one of the hardest things in distributed systems.

Dynamo DB offers two types of reading:

  1. Strongly consistent: these requests are served only by the partition’s leader.
  2. Eventually consistent read: fulfilled by any of the replica nodes.

For writes, only the leader can receive “write” requests.

Given this service’s size, deployments are tricky, and they have multiple ways to handle this. Deployments start in a small set of nodes. Before deployment, a suite of tests is executed for the upgrade version (new version) and the downgrade (older version, in case of rollback).

Once a service is deployed, it is rolled back on purpose to guarantee that this operation can be performed in the case of issues. Nodes might have different versions running at the same time. Sometimes a deployment will change message schemas, and keeping backward compatibility and forward is essential.

To solve this, they implement Read-Write deployments, which split a deployment into multiple phases, starting with a change allowing the current version to read the new message schema. Then the deployment that writes the new message.

Lastly, automatic monitoring after deployments is essential for a service this significant.

For a service this size, there’s a need for some external dependencies (like authentication or fetching secrets). When these services have issues, it will affect Dynamo’s performance. To resolve this, and without jeopardising security, Dynamo DB can run even when some services are impaired by implementing a statically stable design. The process looks like this:

  1. Cache these services’ responses for some time.
  2. Asynchronously fetch the latest data version on every request to maintain it fresh while keeping low latency times with the client.
  3. Continue operating with the stale data until it is no longer valid (by that time, the impaired service’s issues might be resolved).
  4. Any new request not cached while the external service is impaired will fail.

3) Durability and Integrity of the data

The partition leader handles every write operation. Once a write operation is received, the leader will create a write-ahead log entry and replicate it to its followers. Only until they acknowledge this entry does the leader acknowledge to the client the new record.

For writes to happen, there must be enough replicas in the partition quorum. If not enough, all write operations for that partition are paused. When a partition replica in the group goes down, the system will spin up a “log replica”. A log replica is a type of a follower node that can join a partition quorum but only keeps the write-ahead log, not the B-tree with data. Spinning this up can be very fast and will help maintain the quorum.

Silent errors (errors that affect the integrity of the data) are hard to catch. The origin of these errors might be at the code level or transport level, where data get corrupted. That’s why all messages between nodes and other systems are validated using checksums.

Equally, there’s continuous data verification at rest through scheduled processes that verify that the data between the replicas and live nodes is the same.

Backups and restores are generated from the write-ahead logs and pushed to S3. For point-in-time restores, a combination of partition snapshots and WAL is used.

--

--