Anti-Entropy and Merkel Trees: Amazon DynamoDB (Part 4)

Aditya Shete
3 min readFeb 13, 2023

--

A look at how DynamoDB handles permanent failures.

(This post is part of a series of blogs looking at concepts underlying DynamoDB. Part 1, Part 2, Part 3)

We have covered most of the operational concepts that make Amazon DynamoDB tick. The next piece to understand the complete system is permanent failures. The previous post showed how the system uses sloppy quorum to function under transient failures. Under long-term failures, the system would start losing data and eventually become unusable for most use cases except maybe as a random value generator.

Anti-entropy in this context is to be understood as the mechanism that makes sure different replicas remain synchronized. Merkel trees are used to implement anti-entropy in DynamoDB.

Merkel Trees

A Merkel tree is defined as follows:

A tree in which every “leaf” (node) is labelled with the cryptographic hash of a data block and every node that is not a leaf (called a branch, inner node, or inode) is labelled with the cryptographic hash of the labels of its child nodes.

For example, this would mean something like this:

How is this data structure used for anti-entropy?

We want a quick way to compare replicas and merge divergent versions. A Merkel Tree makes it easy to compare individual branches independently without sending full replicas across nodes. The comparison algorithm is as follows:

  • A pair of nodes exchange parent node hash values.
  • If the parent hash values are equal, then we have identical Merkel trees.
  • Otherwise, the pairs exchange child hash values until we reach a set of leaf nodes that differ in the replicas.

This process minimizes the amount of data required to be sent for synchronization and the disk reads needed for retrieving data. Each node maintains a different Merkel tree for its different key ranges. This enables key ranges to be independently synchronized.

There is a problem:

Whenever we have a node dropping/scaling, there results in a rearranging of key ranges resulting in nodes recalculating Merkel trees for each such instance. As calculating Merkel trees is non-trivial, we should have some way to minimize this calculation.

The optimization that DynamoDB does towards this drawback is quite simple. If you refer to the consistent hashing covered in Part 1, we said that virtual nodes are assigned to nodes. These virtual nodes define key ranges to be mapped. The reason for recalculation was that upon addition or subtraction of nodes, the virtual nodes resized. The critical observation is the partitioning of keys and the keys’ placement were being done simultaneously. We can remedy our situation if we decouple them.

Here is how:

DynamoDB partitions the hash space into Q equal parts, typically an order or two higher than the system’s nodes or the system’s replication factor. These ranges are distributed randomly and equally to the member nodes of the system. This makes the load of the system evenly distributed as we wanted but also solved the Merkel tree recalculation problem, as we have already chosen to keep key range-bound trees.

--

--