Dynamo: Design of a Key-Value Store
Today, I am writing about how Amazon designed a distributed key-value store, which is highly reliable, highly scalable, and decentralized.
Dynamo, which is different from DynamoDB, sacrifices Consistency in some failure scenarios to achieve an always-on or available system. Why do have to sacrifice consistency? CAP Theorem. The main goal of designing dynamo was “customer satisfaction”, that the service is always available. There will be some consistency issues which will mostly be resolved in the background without the customer being aware of them and having a great experience.
Let’s summarize the design goals:
- Scalable: The system should be highly scalable. We should be able to add/remove a machine in the system as the requirement of service increases/decreases.
- Decentralized: To avoid single point of failure there should be no leader of any kind if any node fails the responsibility of that node should be taken care of by other nodes.
- Eventually Consistent: It means we can accept a few consistency issues while writing into the key-value store only to resolve them later on. This would reduce the write-time cost and make the system more available.
What all operations dynamo supports?
- get(key): The operation retrieves a value stored against the key. In the background, it determines which server has the key stored and retrieves the value from that particular server. Now here since we do not have a highly consistent system, at any point we can have some conflicts. So instead of returning only a value, it can return a list of values(conflicting versions that were stored in the system) with a context object.
- put(key, value, context): Store the value against the key sent. In the background, it determines which server the key has to be stored on and stores the value against it. The context object is the same object that is returned from the get operation, it is used as a cookie in HTTP Request to verify the validity of the object sent in the put request.
In the above two points, one common operation is to use the key to identify the storage nodes, dynamo achieves this using the MD5 hashing algorithm.
As we have done in the previous articles, we will follow the same approach here, first let's discuss the HLD and deep dive into individual components.
High-Level Design
At a high level, we can say, Dynamo is a Distributed Hash Table (DHT) and for high availability and fault tolerant system we replicate and manage it across clusters.
- Data Distribution: Dynamo uses Consistent Hashing to distribute the data among the nodes. We will see what advantages Consistent Hashing gives us in future sections.
- Data Replication and Consistency: There are multiple strategies for replication and consistency, and dynamo replicates optimistically(what optimistic replication means? We will see in later sections).
- Fault Tolerance: We will need a strategy to determine how many nodes we should write to ensure that our write is fault tolerant. Dynamo uses sloppy quorum for the same instead of the strict majority quorum. We will deep dive into these quorums also in the next sections.
- Detection of Failure: Whenever a failure occurs we will need to detect the same. Dynamo uses gossip protocol to communicate between nodes and detect whether is node is live or not.
- High Availability: As discussed earlier we need a system, which is highly available. One of the core ways Dynamo achieves this is by using hinted handoff.
- Conflict resolution: As discussed earlier our system is not consistent(traded-off for high availability), Dynamo uses vector clocks and version each write to handle consistency issues if any.
If we want to compare the complete node’s data with another node whether they are in sync or not traditional mechanisms like comparing every byte will not work. We will need a system for quick detection, Dynamo uses Merkle trees for the same.
Let's now deep dive into each of the above-mentioned points.
Data Distribution
Since we have a huge amount of data, we obviously cannot store all data in one node, it is not possible and is not fault-tolerant either. So data distribution becomes a necessary requirement.
Two main questions
- How do we know on which node what piece of data will be stored? How are we distributing the keys among all the available nodes?
- What happens when we add/remove nodes? What data has to be moved from existing nodes to new nodes? How to minimize the data movement, and what model of data storage will be optimal for us?
One naive approach is just to use a hash function and determine the node based on the key value. But what if the node goes down or if we add a new node? We will have to alter the hash function again and again.
Dynamo uses Consistent hashing to solve these problems. This also ensures that when a new node is added/removed only a small set of keys has to migrate from one node to another.
Since Consistent hashing is a very popular mechanism I will explain it briefly here just for the sake of completeness. If you want a deep dive into consistent hashing there are multiple articles posted on the internet.
Consistent Hashing
Consistent hashing assumes that all our data nodes are placed on a ring. Each node in the ring is responsible for a range of data as shown in the diagram below.
The only change from the above figure is, our key will be hashed using the MD5 Hashing algorithm, and as we know that hashes have ranges, the ranges are divided into sub-ranges. Each node takes care of one or more sub-ranges, if a key falls into a particular sub-range, then we will be able to determine which node the key belongs to based on the sub-range the node manages.
How does the above scheme helps us when we are scaling in or out? When a node is removed, the next node becomes responsible for all of the keys stored on the outgoing node. However, this scheme does results in the creation of hotspots. Some nodes may get an extra load of data compared to other nodes. The hotspot issue can be solved using tokens.
What are tokens and how does it helps us in solving the scalability issue in Data Distribution? With just one minor modification, instead of having just one subrange mapped to a physical node, we will divide the subrange into even smaller subranges called tokens.
Now each physical node is responsible for multiple tokens, instead of only one big subrange in the ring. These tokens are randomly mapped to a physical node, ensuring that no two adjacent tokens are assigned to the same physical node. This makes the data evenly distributed amongst the nodes. This solved the hotspot issue for us.
This scheme helps us not only in reducing the probability of having hotposts but also if a node fails and recovers, since the data is heterogenous rebuilding becomes easy, as the load is not heavy on a single server.
Replication
We obviously cannot have a single copy of the data for a durable and highly available system. To handle temporary failures Dynamo uses Replication. Multiple copies of the same data are replicated across “N” nodes where “N” is the replication factor.
Each write for a key is accepted by a coordinator node(node is determined by hashing the key as discussed before) and is saved locally and then it syncs the data to “N-1” follower nodes. The syncing process to the follower nodes is an asynchronous process, hence calling the system an eventually consistent system. This replication technique is called as optimistic replication, which means the replicas are not guaranteed to be in sync at all times. Below is a figure for reference.
Each node in Dynamo serves as a replica for a different range of data. As Dynamo stores N copies of data spread across different nodes, if one node is down, other replicas can respond to queries for that range of data. If a client cannot contact the coordinator node, it sends the request to a node holding a replica.
Preference List: The list of nodes responsible for storing a key is called the preference list. Dynamo is designed in such a way that every node in the system can determine the preference list for any node. This list will have more than “N” nodes to account for failure nodes.
Sloppy Quorum and Hinted Handoff
We will have to account for temporary failures of nodes and we cannot assume that every node can be communicated effectively at every point in time. We will have to devise a mechanism to handle temporary failures. Dynamo handles the same using sloppy quorum.
This approach is best understood using an example. Consider the diagram given below.
Let’s now take an example, with Replication Factor N=3. If Server 1 is down the writes will be accepted by Server 4. How did we determine Server 4? Dynamo transfers all the requests for Server 1 to the next available node which doesn’t have the replica for the range between Server 6 and Server 1.
The replica sent to Server 4 will have a hint in its metadata that suggests which node was the intended recipient of the replica (in this case Server 1). This data will be stored in a separate local database in Server 4. Once we detect that Server 1 is recovered, Server 4 will send all the data to Server 1 back, upon successful transfer will delete all the data from its local store. This technique when a node is down another node accepts the write is called as hinted handoff.
This makes Dynamo “always writeable” and since. Every write request is accepted by the system and made eventually consistent across all replicas, which is a sloppy quorum and not a strict quorum.
Since we do not make sure that the data is written to the majority of the nodes successfully, the data will diverge and multiple versions of the same data will be existing in the system. To resolve conflicting versions of data Dynamo uses Vector Clocks.
Vector Clocks and Conflicting Data
What are vector clocks? Dynamo uses vector clocks to capture different versions of the same object. A vector clock is essentially a (node, counter) pair, which is associated with every data object in dynamo. By determining the vector clocks of any data, one can determine where the two versions of data are in sync or if there is a conflict that needs to be resolved.
One can imagine them as how git manages code files and creates branches for files, similarly, vector clocks do for each data object. One important point to note is that the conflicts are resolved in read-time instead of write-time. Let’s consider the example below:
- Server A, gets a write request (k1, v1) and creates a vector clock for this object as [A:1]. This write gets replicated to Server B.
- Another write request arrives at A, and now it is (k1, v2) and the vector clock is created as [A:2]. This write is also replicated to Server B.
- Now let's say a network partition occurs and Server A and Server B cannot talk to each other.
- Server Anow receives a write request and the same key object and the value is updated to (k1, v3), and vector clock [A:3] is associated with the data object. But the data cannot be replicated to Server B. But it gets replicated to another server as part of hinted handoff.
- Now, server B sees a write to key k1, with value v4, and created a new vector clock for the write as [B:1].
- Now, the network heals, and A and B can talk to each other.
- Either server gets a read request for the key
k1
. It sees the same key with different versions [A:3] and [A:2][B:1], but it does not know which one is newer. - It returns both and tells the client to figure out the version and write the newer version back into the system.
In such cases, we cannot make a call that which write has to be considered and we would need an external system to make a call to resolve the conflict. A real-life example would be a client maintaining 2 copies of the shopping cart on an e-commerce website, later on, the client would have to decide which cart items must be persisted.
One clear problem with the above approach is when the system stays for long enough the vector clocks sequence can become too huge(e.g. [A:2][B:1][C:3]…..). Dynamo solves this issue by truncating the old versions of the vector clocks, although we lose the branching information over here, Dynamo in its research paper claims they have not faced any problem with this approach.
Alternate to Vector Clocks: Last Write Wins
Instead of vector clocks, Dynamo also offers ways to resolve the conflicts automatically on the server-side. Dynamo (and Apache Cassandra) often uses a simple conflict resolution policy: last-write-wins (LWW), based on the wall-clock timestamp. If there are two writes at the exact same time, then randomly chose which data to be considered, or go according to the preference list.
Lifecycle of get() and put() operations in Dynamo
Before getting into the lifecycle of the operations we need to clear two following points:
- Which coordinator node the request must be sent to? There are two ways, one is to have a generic load balancer and send the requests evenly across all the nodes. The problem with this approach is the load balancer may not be aware of the preference list of which key belongs to which node, and we may need an extra hop to reach the correct node. Another approach is that client directly sends the request to the node according to the preference list. This requires no hop and hence Dynamo is also called a “Zero Hop DHT”. Here one obvious problem is that apart from the client maintaining its own copy of the preference list Dynamo does not have control over even distribution of the requests amongst the nodes.
- What consistency protocol does Dynamo uses? Let's assume that R and W are the minimum number of systems that must be read or written for a successful read-and-write operation, then Dynamo follows the rule
R + W > N
The common(N, R, W) is (3,2,2), if we want faster writes we can always tune the config as (3,3,1) but the system won't be durable. If we want fast reads then we can use the config (3,1,3) but write will be slow.
put() request steps
- The coordinator generates a new data version and a vector clock component.
- This data is saved locally
- Sends the write request to N-1 highest ranked nodes from the preference list.
- It waits for W-1 node's acknowledgment to call it a successful write into the system.
get() request steps
- The coordinator receives a read request for any key k.
- It requests the N-1 top servers from the preference list for key “k” and only waits for R-1 replies.
- The coordinator handles causal data versions through a vector clock.
- Returns all relevant data versions to the caller.
All the above requests are handled via a state machine. Apart from the above steps, there are a few more steps that the state machine handles.
After the read request is returned to the caller, the state machine waits for a short period of time, to receive the remaining updates. If a stale version was returned by any of the responses, the coordinator updates those nodes with the latest version. This approach is called “Read-Repair” because it fixes all the old versions persisted by any node.
Any node can coordinate a put() request for an even load among all the nodes in the preference list. A small optimization here helps Dynamo improve its consistency, usually, a write operation follows a read operation, and the coordinator for a write operation is chosen to be the node that replied fastest to the previous read operation. This makes dynamo get the data that was read by the previous read operation making the chances of getting read-your-write consistency.
Merkle trees
Resolving conflicts using vector clocks may not be the ideal approach in cases where a node is significantly behind. In such cases, it would be good to resolve conflicts automatically in the background in a much faster manner.
Here comes the use of Merkle trees, Dynamo uses Merkle trees to compare if the replicas are in sync or not. Merkle tree is a binary tree of hashes where every internal node is a hash of its two children. So if the root of two replicas’ Merkle trees does not match we can definitely say the replicas are out of sync. Then we can recursively determine which child of the root is out of sync, and find the highest level of data that has to be replaced. One can read more about Merkle trees here.
This approach is better suited when a node is down for a long time and significant data has to be transferred, else for some keys and short-term failure resolving vector clocks can get out job done.
Gossip Protocol
Finally, we arrive at the last piece which makes our whole design complete is the Gossip Protocol. We obviously need to keep track of the whole system’s node and their status and having a coordinator to do this job will not help since one of our primary requirements is to have a decentralized system and no single point of failure.
So the next logical solution is to think of sending a heartbeat to every node, but this causes too many messages, O(n²) in terms of complexity, where n is the number of nodes.
Gossip protocol is a peer-to-peer communication mechanism in which nodes periodically exchange state information about themselves and other nodes they know about. Every node initiates a gossip round about itself and also every other node’s state information it has to one random node. This means any new event will be eventually propagated to the system and every node will be notified within the cluster.
The below diagram will give a visual detailed flow.
It was fun writing an article on Amazon’s Dynamo, and hope you enjoyed it too! Please post your feedback and thoughts in the comment section below.
References