Internals of DynamoDB

Amazon’s reliable and scalable storage solution

Udit Sharma
7 min readJun 16, 2020

DynamoDB is a key-value database solution provided by Amazon that aims to achieve high availability and scalability using some of the renowned distributed data system techniques. This article aims to shed some light upon how Dynamo uses these techniques to provide a great solution.

When we talk about availability, it becomes important that we also mention consistency. As per the famous CAP theorem, database systems have to ideally chose between consistency and availability, given that systems are partition tolerant. Dynamo sacrifices consistency under certain failure scenarios and uses object versioning to overcome loss of consistency.

Let’s go through various techniques that Dynamo has under its sleeves.

Partitioning

For achieving a scalable design, Dynamo requires a mechanism to dynamically partition the data over the set of nodes in the system. Dynamo uses consistent hashing technique to distribute the load across multiple storage hosts.

Each node in the system is assigned a position in the circular hash ring. When there is an incoming request for key, a hash is generated for it and is mapped on this circular ring. The node that is next to the key hash is responsible for the storage of that key. The principle advantage of consistent hashing is that the departure or arrival of a node only affects its immediate neighbours and other nodes remain unaffected.

Fig 1. Consistent Hashing with virtual nodes

But, this basic form of consistent hashing can cause non uniform data and uneven load distribution as there is a possibility that majority of hashed data gets mapped to a single node. To solve this, Dynamo adds virtual nodes in the hash ring where 1 physical node can actually represent multiple virtual nodes by using multiple hash functions. This leads to enriched hashed ring with data evenly distributed across the nodes. As shown in Fig 1, positions 2,7 and 10 act as virtual nodes for node B and 4 and 9 act as virtual nodes for node A. Request for keys which have hash values 11,0 or 1 will be handled by 2 (node B). This node is also called coordinator node which is responsible for request handling and data replication for the key.

Replication

For achieving high availability and durability, DynamoDB does data replication in multiple nodes. Each data item is replicated in N nodes excluding virtual nodes and each key is assigned to a coordinator node as described above. This node is responsible for the read and write of this key and also ensures that the data is replicated across N-1 nodes. These N nodes in conjunction make up the preference list.

Data versioning

Dynamo being a distributed system, has the possibility to have multiple versions of an object present at a given instance of time. Hence, it becomes critical to have a conflict resolution strategy to maintain data integrity in the system.

How can there be multiple versions of an object at a same time?

Let’s say client A writes x=a in node 1. Node 2 and node 3 are also part of the preference list, hence they will be responsible for the read replication. At some point of time, node 1 becomes offline resulting in node 3 becoming the coordinator node. After this, x=b is written in both node 2 and node 3, resulting in two values for x in the system.

Fig 2. Multiple versions of x in the system

Dynamo uses vector clocks to maintain the object versioning across multiple replicas. A vector clock is effectively a list of (node, counter) pairs. One vector clock is associated with every version of every object. Let’s see an example to understand this better [Fig 3.]

Fig 3. Vector clock illustration

In the above example, node A is the coordinator node for object x. When node A receives the first write, it sets the vector clock as [Sa, 1]. After a subsequent write, vector clock is set at [Sa,2]. Before the third write, node A becomes unresponsive and consequently the write is handled by node B. Node B is already aware of the previous version and hence after the write sets the vector clock as ([Sa,2],[Sb,1]). Because of partition failure between node B and C, node C does not receive the update of D3 object. Now when node C handles the write, it updates the vector clock as ([Sa,2],[Sc,1]). When client makes a read request, Node A will get 2 versions for x which are D3 and D4. Node A will try to reconcile the vector clocks to ([Sa,2],[Sb,1],[Sc,1]) and return a response which can have D3 and D4. Client updates the value of object to D5 and vector clock now becomes ([Sa,3],[Sb,1],[Sc,1]). The same reconciled version is written to nodes that do not have the updated version of the data item.

As we can see, vector clocks can find the causal relationship between multiple events that occur in the system and help in resolving the conflicts. When client reads an object, it is possible to get multiple versions of the object in the response. If Dynamo is able to find the causal relationship, it reconciles the data to keep the latest version else clients might have to manually reconcile them.

Quorum Technique

When does Dynamo consider a read or write to be successful? For this we need to understand what quorum is. Quorum refers to the minimum number of members which should agree to mark an action as successful. More generally, if there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes for each read. As long as w + r > n, we expect to get an up-to-date value when reading, because at least one of the r nodes we’re reading from must be up to date.

Failure Handling

In order to achieve higher availability, Dynamo instead of using traditional quorum technique, uses sloppy quorum. Reads and writes still require r and w successful responses, but those may include nodes that are not in the “preference list” for the key.

If a node fails, then the replicated data that needs to be present in that node is temporarily persisted to some other available node. This temporary node has the information about the node for which the replicated data was intended. Once the network partition is fixed or the replica node is back online, writes that are written to the non preferred node are written back to this node. This strategy is also known as hinted handoff where a node acts as a temporary host to increase availability of writes.

Fig 4. Hinted handoff

In Fig 4, key k has to be replicated to node A and B. But node B becomes unhealthy, because of which its data is replicated in node C. When node B becomes healthy again, data is handed back to node B.

Replica Synchronisation

Hinted handoff only works if the node failure rate is low and transient so that huge amount of data in not transferred across the nodes. It is also possible that hinted node went offline before transferring the data to original replica node. In order to deal with this scenario and other greater threats to durability, dynamo uses anti-entropy protocol to maintain replica synchronisation.

In order to detect the inconsistencies between replicas and reduce data transfer, Dynamo uses Merkle trees for comparison. A Merkle tree is a hash tree where leaf nodes contain hashed values of each key. Parent nodes higher in the tree are hashes of their respective children. If the hash values of the root nodes of two trees are the same, then it means those two trees are equal. In the case of DynamoDB, we create a Merkle tree of the replica on each node and compare them. If the root hashes of trees are the same, then it means the replicas are in sync, whereas if the root hash is not the same, then that means that the replicas are out of sync, and to find the point of discrepancy subsequent child nodes are compared.

Each DynamoDB node maintains a Merkle tree for each and every key range it has. By doing this, it allows DynamoDB to check whether certain key ranges are in sync or not. If it finds any discrepancy, then child-wise traversal is done to find the cause of the discrepancy, as shown in the Fig 5.

Fig 5. Merkle trees for Node A and B

Summary

Amazon DynamoDB is designed to ensure predictably high performance and to be cost efficient for workloads of any scale. These various techniques are used extensively in many distributed systems to achieve highly scalable and robust solutions. Cassandra and Riak use replication synchronisation. Riak also uses vector clocks for conflict resolution. Consistent Hashing is one of the most important algorithm that finds its application in load balancing, data partitioning, routing algorithms, and many more. Quorum technique is extensively used in Zookeeper, Kafka, Cassandra etc.

References

[1]: DynamoDB Research Paper https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

[2]: Designing data intensive applications by Martin Kleppmann. A must read for all :P

[3]: Mastering DynamoDB by Tanmay Deshpande

--

--