This week I read up on Dynamo: Amazon’s Highly Available Key-value Store. Amazon’s platform is built upon different techniques working together to provide a single powerful, highly-available system. One of the core components powering this system is Dynamo. There are many services in the Amazon ecosystem which store and retrieve data by primary key. Take the example of the shopping cart service. Customers should be able to view and update their cart anytime. For these services, sophisticated solutions like RDBMS, GFS, etc are an overkill as these services do not need a complex query and data management system. Instead, they need a service which only supports read and write operations on a (key, value) store where the value is a small object (less than 1Mb in size) uniquely identified by the key. The service should be scalable and highly available with a well-defined consistency window. This is what Dynamo is: a scalable, distribute, highly available key-value store that provides an “always-on” experience.
Dynamo achieves high availability at the cost of weaker consistency. Changes propagate to the replicas in the background and conflict resolution is done at read time to make sure none of the write operations can fail. Dynamo uses simple policies like “last write win” for conflict resolution though applications using Dynamo may override these techniques with their own methods. Eg the cart application may choose to add items across all versions to make sure none of the items is lost.
A service could depend on multiple services to get its results. To guarantee that a service returns its results in a bounded time, each dependency in the service has to return its results with even tighter bounds. As a result clients enter into a contract with servers regarding service-related characteristics like expected request distribution rate, expected latency and so on. Such an agreement is called Service Level Agreement(SLA) and must be met to ensure efficiency. SLA apply in the context of Dynamo as well.
Dynamo supports incremental scaling where the system is able to scale out one node at a time. Moreover, all the nodes are symmetrical in the sense they have the same set of responsibilities. Since Dynamo is used only by Amazon’s internal applications, there are no security related requirements like authentication and authorization .
Dynamo exposes two operations: get() and put(). get(key) returns value or list of values along with context objects corresponding to the key. put(key, context, value) stores the value and the context corresponding to the key. context objects are used for conflict resolution.
To support incremental scaling, Dynamo uses consistent hashing for its partitioning scheme. In consistent hashing, the output range of a hash function is treated as a fixed circular space. Each node and data object is assigned a random value or position within this space. A data object is mapped to the first node which is placed clockwise to the position of the data object. Every data item is replicated at N hosts. So every time a data item is assigned to a node, it is replicated to N-1 clockwise successor nodes as well. The list of nodes storing a data item is called its preference list. Generally preference list contains more than N nodes to account for system and network failures. An example case is shown with N = 3. Any key between A and B would be mapped to B (by consistent hashing logic) and to C and D (by replication logic).
Each time data is created/updated, a new version of data is created. So for a given key, several versions of data (or value) can exist. For versioning, Dynamo uses vector clocks. A vector clock is a list of (node, counter) pairs. When a put operation reaches node X, the node uses the context from the put request to know which version it is updating. If there is an entry corresponding to X in vector clock, the counter is incremented else a new entry is created for node X with counter = 1. When retrieving value corresponding to a key, the node will resolve conflicts among all versions based on Dynamo’s logic or client’s logic. A likely issue with this approach is that the vector clock list may grow very large. To mitigate this, Amazon keeps evicting pairs from the list in ascending order of the time when the entry was created till the size reaches below a threshold. Amazon has not faced any issues related to loss of accuracy with this approach. They also observed that the % of data with at least 2 versions is about 0.06%
Dynamo uses a quorum system to maintain consistency. For a read (or write) operation to be successful R (or W) number of replicas out of N replicas must participate in the operation successfully with the condition that R+W > N. If some of the first N replicas are not available, say due to network failure, the read and write operations are performed on the first N healthy nodes. eg if node A is down then node B can be included in its place for the quorum. In this case, B would keep track of data it received on behalf of A and when A comes online, B would hand over this data to A. This way a sloppy quorum is achieved.
It is possible that B itself becomes unavailable before it can return the data to A. In this case, anti-entropy protocols are used to keep replicas synchronized. In Dynamo, each node maintains a Merkle tree for each key range it hosts. Nodes A and B exchange the roots of Merkle trees corresponding to set of keys they both host. Merkle tree is a hash tree whose leaves are hash values of individual keys and parents are hash values of children. This allows branches to be checked for replication without having to traverse the entire tree. A branch is traversed only when the hash values at the top of the branch differ. This way the amount of data to be transferred for synchronization is minimized.
The nodes in a cluster communicate as per a gossip-based protocol in which each node contacts a random peer and then the two nodes reconcile their persisted membership history. This ensures an eventually consistent membership view. Apart from this, some nodes are marked as seed nodes which are known to all nodes including the ones that join later. Seed nodes ensure that logical partitions are not created within the network even when new nodes are added. Since consistent hashing is used, the overhead of key reallocation when adding a new node is quite low.
There are 2 modes of routing requests in Dynamo. In the first mode, servers route the request. The node fulfilling the request is called coordinator. If it is a read request, any node can act as the coordinator. For a write request, the coordinator is one of the nodes from the key’s current preference list. So if the write request reaches a node which is not in the preference list, it routes the request to one of the nodes in the preference list.
An alternate approach would be where the client downloads the current membership state from any Dynamo node and determine which node to send the write request to. This approach saves an extra hop within the server cluster but it assumes the membership state to be fresh.
Apart from the architecture described above, Dynamo uses optimizations like read-repair where, during quorum, if a node returns a stale response for a read query, it is updated with the latest version of data. Similarly, since writes follow reads, the coordinator for read operation is the node that replies fastest to the previous read operation. This increases the chances of having read you write consistency.
To further reduce the latency, each node maintains an object buffer in its main memory where write operations are stored and written to disk by a separate thread. The read operations also first refer the in-memory buffer before checking the disks. There is an added risk of the node crashing before writing the objects from the buffer to the disk. To mitigate this, one of the N replicas performs a durable write — that is, the data is written to the disk. Since the quorum requires only W responses, latency due to one node does not affect the performance.
Amazon also experimented with different partitioning schemes to ensure uniform load distribution and adopted the scheme where hash space is divided into Q equally sized partitions and placement of partition is decoupled from the partitioning scheme.
Although Dynamo is primarily designed as a write intensive data store, N, R and W provides ample control to modify its behavior for other scenarios as well. For example, setting R = 1 and W = N makes it a high performance read engine. Services maintaining product catalog and promotional items can use this mode. Similarly setting W = 1 means a write request is never rejected as long as at least one server is up though this increases the risk of inconsistency. Given that Dynamo allows the clients to override the conflict resolution methods, it becomes a general solution for many more scenarios than it was originally intended for.
One limitation is the small size of data for which it is designed. The choice makes sense in the context of Amazon but it would be interesting to see how storing larger values affects its performance. The response time would obviously increase as more data needs to be transferred and in-memory buffers would be able to store lesser data. But using caching and larger in memory buffers, the response time may be brought down to the limit that Dynamo can be used with somewhat larger data objects as well.
Dynamo scales well for a few hundred of nodes but it will not scale equally well for tens of thousands of nodes because of the large overhead of maintaining and distributing the routing table whose size increases with the number of nodes. Another problem that Amazon did not have to face was a high conflict resolution rate. They observed that around 99.94% requests saw exactly one version. Had this number been higher, the latency would have been more.
All in all, Dynamo is not a universal solution for a distributed key-value store. But it solves one problem and it solves it very well.