Sloppy Quorum, Hinted Handoff: Amazon DynamoDB (Part 3)

Aditya Shete
4 min readFeb 12, 2023

--

A look at the quorum-based system of DyanmoDB.

(This post is part of a series of notes on DynamoDB. Part 1, Part 2)

In the first part of this series, I looked at how Amazon’s DynamoDB scales and handles failures by the good ol’ redundancy. The data is replicated to N nodes on the hash ring to recover from temporary failures. What was left out of that discussion was how failures are handled.

Before we jump into the failure mechanism, let’s understand what happens when we execute the get(key) and put(key, value) commands. Say we want to perform some operation on key k; it will be handled by its respective node as determined by consistent hashing. This node is the coordinator node and is typically the first in the preference list. This preference list is nothing but the list of nodes where a key would get replicated according to hashing ring. These nodes are also the ones involved in the complete operation.

The read and write operations are R and W for short. R_N and W_N are system-level configurations for the minimum replication on writes and reads.

Here is the algorithm:

  • A coordinator node receives a W request, upon which it stores the value locally, updating its vector clock.
  • The coordinator node sends the new version of the pair to its quorum, the N-1 nodes on the preference list.
  • Nodes respond to the coordinator’s request by storing/updating the pair locally and sending back acknowledgements.
  • The write is considered successful if the coordinator receives W_N-1 acknowledgements.

The R operation is similar except that when a coordinator requests its quorum to return k, it can receive multiple versions from the R_N nodes. Under such a scenario, the coordinator may try to reconcile the versions or else returns divergent versions back to the client as context.

This is similar to how most quorum-based systems work, where a minimum replication makes it failure resistant. The typical trade-off consideration while configuring the quorum is the values of W_N and R_N, which should result in a quorum if:

write nodes(W_N) + read nodes(R_N) > N (replication factor)

The R and W operations are bounded by the slowest responder node. In reality DynamoDB is operating under:

write nodes(W_N) + read nodes(R_N) < N(replication factor)

This is done to maintain the latency requirements of the system.

All good ...? Nah.

Now the scheme has the following flaw:

A quorum system becomes unavailable under network partitions and failures. This is due to the requirement for minimum responses from the quorum required for the successful execution of operations.

How are we to maintain our write ready system?

The critical idea that DynamoDB uses is as follows:

Rather than wait for the first N nodes from the preference list, it uses the first N healthy nodes to create the quorum. It will happen that the first N healthy nodes are different from the preference list. The replication must now take care of this case. This is where hinted handoff comes into play:

  • Detect the first healthy N nodes intended for replication.
  • Include the information of the node for which the replica was meant.
  • If a node receives a write request with a different intended node and it is determined that the node is down, store it in a separate local DB.
  • When the node is up, it can receive records for replication from other nodes.

This would look something like this:

The key K is to be replicated to nodes B-D. The node D is temporarily failing, so the system will send the replica to node E. This replica is then sent back to D when it is live. Even the coordinator node can fail, in which case the next N nodes take up the quorum.

In this manner, a sloppy quorum is created where nodes act as if they are part of a quorum so that operations continue. Thus, even under partial temporary failures, the system can meet high availability agreements. Note that we have not made the system fully available unless we set W_N = 1; instead, we have made a trade-off where the system is resilient to a class of failures.

Concluding Notes

The DynamoDB can make the data replicated across several data centers if the nodes are situated so. This achieves a higher resiliency as a new class of failures, such as power outages, natural disasters etc., are taken care of. All of this is taken care of by carefully constructing the preference list of nodes or the quorum.

These measures only take care of temporary failures, if the hinted replica itself fails then there is no way to recover data. For such cases there are anti-entropy algorithm that DynamoDB uses. I’ll cover this issue on the next post, stay tuned!

--

--