­­Building DynamoDB brick by brick

Jay Pavagadhi
The Startup
Published in
11 min readSep 22, 2019

Have you ever wondered how DynamoDB scales to virtually unlimited throughput and storage? Are you surprised with how it delivers millisecond responsiveness at almost any scale? Are you puzzled with DynamoDB way of doing things, i.e., choosing hash keys, range keys, limited query & transaction support, local vs global index, pre-configured hard limits for read & write capacity units and many more? Well, the best way to understand it would be to build the whole system conceptually by ourselves. And that should give us a lot more insights into why things are the way they are!

Let us set our end goal to what DynamoDB’s claim is — A key-value data store(1) that is scalable(2), durable(3), fast(4) and highly available(5).

1. Key-value data store

A key-value data store is a collection of key, value pairs where key is typically a unique identifier. Here’s an example of data stored in key value store.

Figure 1 — Sample key-values

Designing a key-value store seems an easy start compared to traditional Relational Database.

  • One way to store key-value pairs is as a list in a file. Writing a new entry would be as easy as appending it to the list. The downside would be that we need to scan the whole file for reading some unique key. Can we do better?
  • Sure, how about storing data in hashmap? It seems a natural fit for such kind of data. However, we need to serialize it and store it on disk for persistence. Every time the database instance starts, we can load the map in memory. Reading unique key from a map is super-fast but the problem is with Writes. Every time we write a new key-value pair, we would need to serialize the whole map and replace it on disk. Also, since we are loading hashmap in memory, the amount of data is limited to the size of memory not the disk. So, how can we have efficient reads and writes altogether while utilizing full storage capacity of the disk?
Figure 2 — Reading & Writing to key-value storage node
  • Let us store key-value pairs sorted by keys on disk. We will also generate and store index — very similar to what’s found on end pages of books. In our case, it’s a list of an evenly spread file offset pointers pointing to key-value pairs in the table. Index is useful for quick access of data on disk. Since index are meant to be much smaller in size than the original table, we would load it in memory. To read a key, we would first lookup index and then find the corresponding offset in the table and keep reading sequentially until we get to the desired key. As you can see, it’s much more efficient than reading whole table. This setup is also called SSTable (Sorted String Table). Since adding an entry to SSTable basically requires rewriting it, we would instead do it periodically in a batch. Idea is for each write request we would store key-value pair in memory as a hashmap and on disk as append-only commit logs to persist it. That makes write operation fast and persistent. That also means for reads we would need to lookup hashmap as well to see if there are any latest update. Also, we would need to collapse commit logs to SSTable periodically. Once done we can clean up hashmap and logs and start over.

💡 While creating a table on DynamoDB, you would be asked to set hash key and optional range key. In simple terms, one can assume the hash key and range key combinely forms the key that we discussed earlier. So, all the data would be first sorted by hash key and then the range key. The hash key is also used to define partition boundaries.

2. Scalable

Gone are the days when we tried to build a more and more powerful machine to serve bigger capacity databases. Also, scaling relational database systems on multiple machines didn’t work well, as they were never designed to scale at first place. So, we would use commodity hardware and lots of them.

We need some ways to store lot of data. The obvious way to do that would be to spread it across multiple machines — which I would call ‘nodes’ from now on. The question that would arise is how to distribute data across nodes?

  • The simplest way would be, to start with the first node and once it’s full, we can move on to the next node. The other approach could be to store each key-value pair in round-robin fashion where each node would end up storing an equal amount of data. The problem with both of these approaches is, every time user queries for the data, we’d need to scan all the nodes and that’s very inefficient. We need to be smart here.
  • How about storing a predefined range of keys on each node? For example, the first node could store string keys starting with A-F and then G-L and so on. That way we can distribute data across nodes and at the same time query the data from the exact node that stored it. Considering keys could be of a different type, we’d use the hash of the key to determine the node. Great! That works as long as we have a predefined number of nodes. Given that we would want to add/remove nodes to scale it dynamically, we would need to perform lots of reshuffling of data. As in following figure-3 adding new node requires all the nodes to reshuffle some data.
Figure 3 — Reshuffling of data in simple hash partitioning scheme
  • To limit the amount of data reshuffling, we need to be even smarter here. How about treating range on circular space? Let us talk about Consistent Hashing. As in following figure-4A, the output range of hash function is plotted on a circular space, hence each key yields certain position on this ring. Also, each node is assigned a random position on the ring. While walking the ring clockwise from the position of the key, the first node that we encounter is where the key would be stored. Therefore, each node is responsible for the region in the ring between it’s own position and its predecessor node’s. Now, if we add a new node in figure-4B, we would end up with uneven data distribution. To avoid that, instead having one position per node, we would have k positions or points randomly scattered on the ring (figure 4C). That means, adding a new node would require a reshuffling of data only on k nodes. The higher the value of k better the data distribution but also the greater the amount of reshuffling. Hence, the value of k should be chosen wisely.
Figure 4 — Consistent Hashing

💡 As keys are sorted, DynamoDB can natively support prefix/range queries. If we want to randomly access any other key (for example birthdate in figure-1) or to perform a range query on it, DynamoDB offers global and local index for this purpose. As the name suggest global index sorts the keys across all the partitions while local index sorts the keys within the partition. So, if we have multiple records for a given hash key that needs to be sorted by non-range key, we should use the local index.

3. Durable

Nodes can go down for wide variety of reasons and especially when we are dealing with lots of them. Odds are quite high that not all of them are always up and running. If we cannot bring back dead machines, we would end up losing partial data. To solve this problem, we will replicate each partition on multiple nodes. Therefore the key would be stored not just the next node on the ring as in figure-4 but next N nodes, where N is the desired number of replicas.

4. Fast

Replication comes with their own set of problems. Let’s say we keep copies of each partition on N nodes. We write to at least W nodes before acknowledging success and read from at least R nodes before returning data back. Let’s say, we keep 3 copies (N=3). For highest throughput, we can set W=1 and R=1 which means we can read from any copies and write to at least one of the copies before acknowledging successful operation. And let those nodes propagate changes to other nodes in case of writes. But that also means we may not read the latest data or write may never reach to all copies if node dies before propagating changes. So, if we want better consistency, it’s a good idea to set W and R more than 1 where we wait on at least few nodes to commit writes before returning back and same way read from multiple copies and respond with data that majority of copies agree on. Depending on data consistency vs response time requirement, we should choose, the value of W, R and N wisely. Also, to boost write speed, the other smart thing that can be done is, since writing in memory is fast compared to writing it on disk, we could wait on the majority of replicas to acknowledge memory writes and just one for disk write before returning the success.

💡 DynamoDB read API offers a flag to enforce strong consistency which I assume forces it to wait on responses from the majority of copies before returning.

💡 DynamoDB global index does not support strong consistency. Now we can guess why? Because we don’t want to wait on all global indexes to get updated before acknowledging successful writes.

💡 DynamoDB requires upfront provisioning of Read and Write capacity for the table as that may affect the number of partitions and number of replicas.

💡Also, now we can imagine why DynamoDB offers very limited query & transaction support compared to traditional relational databases. Thanks to Partitioning & Replication!

5. Highly Available

Transient failures- more so than permanent failures, are quite common in distributed systems. In order to make sure, we maintain minimum N copies of data, if some nodes are non-responsive, we can always walk further on the ring and hand it over to the next available node. We would also need to add some metadata to indicate the key was meant to be for non-responsive node. So, when it comes back up, data can be transferred back.

Replicas can go out of sync for various failure modes. Therefore, we need to run a background job to keep them in sync. It’s really expensive to go through every record just to compare replicas across nodes. How about comparing replica hash instead? That’s good to some extent but with even single record mismatch, the hash would be different and we are back to square one. Therefore, we would be using hierarchical hashes — and that’s what Merkle Tree is all about. We would be using a hash tree where leaves are hashes of the values of individual keys. Parent nodes are hashes of their children. The advantage of the Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire tree or the entire data set.

Figure 5 — Merkle Tree

When read/write request comes to store, there are two ways that it can reach to the desired node. One would be, to have special purpose nodes that are aware of key ranges and the list of nodes serving it, so that they can appropriately forward requests. The other way is to randomly pick a node to send the request to and then it forwards it to the node where the key resides. In the later case, each node should be aware of its peer’s key ranges. So, how would they figure it out? They do so by using what’s called Gossip Protocol. Apparently, each node sends it’s latest information to it’s known peers and then they propagate it further throughout the network. Such periodic communication is critical to sync the latest key range distribution and ongoing failures.

Figure 6 — Illustration of Gossip Protocol

The proposed quorum system means there are chances that multiple version of the same record may exist because of a variety of failure modes. The system will need to reconcile it in the future. Easy way to address this is based on timestamp, we can discard older records. But clock may be skewed on different nodes. (💡That’s exactly why Google uses atomic clocks and GPS receivers to keep its Spanner data centers in sync) Another way is to use what’s called Vector Clocks. With each key-value record, we would store the age. Each node independently calculates the age. We would store the vector of ages defined by all the nodes serving that key. The age is calculated based on maximum of received vector and local vector. Also, the host node would increment its age by 1 before writing the record to the disk. By comparing records, in most cases, it is easy to determine the latest record.

Figure 7 — Vector Clock stored as record metadata

With all that, we have conceptually built major aspects of DynamoDB. Along the way, we were able to reason out its offerings and limitations.

Note: I don’t claim that DynamoDB works exactly as described here. The internals aren’t in public domain. This comes from my understanding of Dynamo, Big Table & Cassandra papers. Also, a lot of inferences are made based on what all functionalities DynamoDB offers.

References:

[1]: G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo: Amazon’s Highly Available Key-Value Store,” ACM Symposium on Operating Systems Principles, 2007.

[2]: Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. 2006, “Bigtable: a distributed storage system for structured data,” in Proceedings of the 7th Conference on USENIX Symposium on Operating Systems Design and Implementation — Volume 7 (Seattle, WA, November 06–08, 2006). USENIX Association, Berkeley, CA, 15–15.

[3]: A. Malik and P. Lakshman, “Cassandra: a decentralized structured storage system,” SIGOPS Operating System Review, vol. 44, no. 2, 2010.

[4]: J.C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J.J. Furman, S. Ghemawat, et al., ‘‘Spanner: Google’s globally-distributed database,’’ in Proc. 10th Conf. Oper. Syst. Des. Implement. (OSDI), 2012.

--

--