These are my jottings from the https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf — The first whitepaper, written in 2007, which detailed dynamo features that were the base for Amazon DynamoDB and Apache Cassandra.
Performance, reliability, and efficiency are the key factors, Dynamo is described as highly available and scalable distributed data store built for Amazon’s platform.
Data is partitioned and replicated using consistent hashing, and consistency is facilitated by object versioning.
Wikipedia has a good description of the Consistent Hashing concept and its technique. Basically, it’s a hashing technique where the existing hashes don’t have to change when the hash table size is increased.
Very basic example: For Hashtable with 10 entries — we might use % modulus which returns a hash of 2 for number 52, 5 for number 55 and stored in those hash references. However, if we increase the hashtable to 20, the hashes change to 12 and 15 which means existing keys have to be modified.
With Consistent hashing, it’s guaranteed that only K/n where K is the number of keys and n the slots (in above n=10 for the first case) have to be remapped. It’s usually achieved by mapping an entity to a point in the circle (or an angle).
Do note that for Dynamo, instead of one mapping, there are multiple mappings in the circle to create virtual nodes. This helps in the distribution of load if one node goes down or if a new node is inserted.
DynamoDB was designed with below requirements:
- Query model: Ability to query everything out of a key and small set of data (<1 MB)
- ACID properties: with weak C (consistency), non-guaranteed I (isolation) and single key updates
- Read / Write Conflict Resolution: Amazon required that all writes be written no matter what the state was, forcing Read Conflict resolution, which is in contrast to most popular databases.
- Incremental Scalability — scale one node at a time
- Symmetry (& Decentralization): Every node in dynamo has a set of responsibilities as its peers.
- System Interface: MD5 Hash of key generates 128-bit identifier, used to determine the storage nodes for storing/serving the key
- Partitioning Algorithm: As mentioned earlier, consistent hashing is used to generate a point over a ring that denotes a node. This is assigned a random value denoting its position and new items are found by walking the ring clockwise to find the first node with a position larger than the item’s position. (Hence adding/removal of a node only affects its immediate neighbors)
To resolve load distribution issues, dynamo maps to multiple points as described earlier.
- Replication: Date replicated on N nodes. Coordinator node is responsible for replication between it and the N-1 nodes.
- Data Versioning: Since dynamo uses eventual consistency, there might be stale reads from replicas. Dynamo uses Vector Clocks (a list of (node, counter) pairs) that is associated with every version of every object.
- get() and put() operations: if create operation, a new vector click is created and stored locally (sent to W-1 nodes for success response). get() request reads from R responses and sends all versions of data back if it’s not same. Updates, the client is expected to send back the Vector Clock
- Handling Failures — Hinted handoff: “Sloppy Quorum” — Read and write are performed on the first N healthy nodes, which is not the same as the first N nodes encountered. Replication on nodes is across data centers which are connected through high-speed network links.
- Handling Failures — Permanent — Replica Synchronization: Dynamo implements an anti-entropy protocol. Dynamo uses Merkle Trees (hash tree where leaves are hashes of the value of individual keys; parents are hashes of their respective children)