Consistent Hashing: Amazon DynamoDB (Part 1)

Aditya Shete
5 min readFeb 9, 2023

--

A look at how consistent hashing is used in DynamoDB

I’ll be covering the concepts underlying the distributed Hash Table, DynamoDB, in a series of posts.

Amazon DynamoDB Icon

Introduction

Amazon’s DynamoDB is a central offering in AWS’ cloud ecosystem. It is a great distributed system to understand some of the tradeoffs that are made in such large-scale systems. The system was designed to offer certain features that were required for Amazon’s internal services. The overall requirements can be summarized as follows:

  • A simple key-value store with no relational schema for complex queries and joins. Each query is for a unique key. The values are stored as BLOBs and are of the order of ~1MB.
  • High availability of the system for writes. This means that the system will always accept writes even in the face of network partitions or node failures. In terms of ACID properties of RDMS, this is sacrificing some consistency in return for high availability.
  • Hardware requirements are loose, which is what facilitates the usage of a cloud environment where commodity hardware is used to host services. The system’s efficiency is required to be in 99.9th percentile for latency.

It is further assumed that the service is running internally so that some of the security issues can safely be assumed to be absent from the system.

Now that we have the system’s constraints, we will try to build it up using a modular design so that concerns are separated. One of the first things a store should provide is a flexible way for the system to scale. If we have a single node implementing our system, we can have it store all the data. Once this server is filled up, we might want to add another node to serve requests. Considering it from another point of view, if we want our efficiency requirements to be met, the first thing we would like to do is add further hardware to distribute load across the nodes. How do we do this?

In a hash table, we store values using a hashed as index, this helps us calculate the position of the value deterministically. Considering the scenario of a distributed hash table, we might use the hashed keys to determine the node where it should be situated as well. This can be achieved by a simple mod operation so that we have:

Node(to which key is mapped to) = Hash Function(key)%(Number of nodes in the system)

This scheme works until we consider that we might face failures in nodes so that they drop out of the system or the need to scale makes us add more nodes. The new mod will start mapping keys to different nodes, we will have to undertake the reshuffling of data so that the new mapping is consistent.

This is of course a fatal shortcoming that we need to fix.

Consider the following solution, if we can take a look at the possible set of hashed key values, this is the domain which we want to dynamically map to the servers. If we consider the domain to be a ring, such that the start and end of the domains are joined, we can equipartition the domain and map them to the servers. This would like something as follows:

An example distribution of keys and servers in the hashed key domain.

How do we know what range a server will have?

The idea is simple, after we have mapped the servers in the domain space, we can find the server as follows:

  • Find the key’s hashed value, i.e., the point in the domain where it lies.
  • Move clockwise/anticlockwise, equivalent to incrementing/decrementing the value till we find a server’s hash value.

In such a manner, all the possible keys are now mapped to a server. This makes it so that we can quickly calculate the server where a request must travel to.

Determining the node which a server is mapped to is a simple operation.

How does this scheme solve our problem?

Consider the case where we need to add/remove nodes from the system. This means that the keys to node mappings are reshuffled, but in the consistent hashing scheme we will have reduced the number of keys that need to be remapped to a fraction. This is due to the fact that we are partitioning the domain rather than have a functional mapping.

This is an obvious upgrade over our previous scheme.

There are still some issues that consistent hashing faces:

  • Uneven load distribution of keys. This follows from the fact that some keys are more frequently accessed or contain more data than the average values. We want some mechanism so that load is more evenly distributed.
  • Heterogeneity in nodes capacity is not factored in, which is an area for optimization. We want some manner in which we can distributed key ranges proportional to the node's capacity.

The consistent hashing algorithm can be improved so that these concerns are addressed. We introduce the concept of virtual nodes, which are replicas that we create in the domain mapping. A node can have multiple virtual nodes so that all those key ranges are mapped to the same physical node. This elegantly solves our requirements so that higher capacity nodes have more virtual nodes and having multiple nodes makes less likely that all high impact keys are on a small number of nodes.

The algorithm of determining the key containing node is slightly modified:

  • While moving clockwise we stop at the nearest virtual node and determine which physical node that virtual node corresponds to.

How is the scheme used to replicate data in DynamoDB?

The primary physical node that handles a key range is known as the coordination node for that range. The coordination node is responsible to replicate the data to say N other nodes. We can easily extend the previous schema so that the coordination node selects the next N virtual nodes to replicate the data. Further we can also make it that virtual nodes for the same physical nodes are skipped so that we have N physical nodes to replicate. In effect we make each node responsible for storing previous N key ranges. We go can go on further by pre calculating the replication data to account for failures by having a list of N+M nodes available to the coordination node.

The exact replication method and failure recovery schema will be covered in future posts. Stayed tuned!

References:

  1. Amazon DynamoDB
  2. The pictures are taken from: The Ultimate Guide to Consistent Hashing | Toptal®

--

--