Notes on distributed system implementation using dynamo DB and CAP theorem

Background

  • most production system stores and retrieve data through primary keys and do not require complicated queries

For distributed system: CAP theorem limits

  • Consistency — a read will return the most recent write
  • Availability — a non-failing node will return a reasonable response within a reasonable amount of time
  • Partition Tolerance — system will continue to function when network partition occur

Since network partition will most likely to occur, we have two options on our distributed system:

  • CP: choose consistency over availability when network partition occurs. until network partition is resolve, nodes can return an error, use case: for systems that require atomic reads/write
  • AP: choose availability over consistency when network partition occurs. accepts write and read even with network partition for availability -> reads can be stale but will eventually made consistent

Amazon’s Dynamo DB Implementation:

  • Key/value storage system, distributed
  • sacrifices consistency (on certain failure instances) for availability
  • uses object versioning, application-assisted conflict resolution
  • demonstrates that an eventually-consistent storage system can be used in production
  • shows how a combining different techniques can provide a single highly-available system
  • addresses:
  • partitioning, replication, versioning, membership, failure handling and scaling

Dynamo Features

  • if with conflicting data (since dynamo wants availability rather than consistency, write is always allowed. with the application or the default “last write wine” policy
  • Dynamo does not focus on the problem of data integrity and security and is built for a trusted environment.
  • achieve low latency via no routing at all

To achieve scalability and availability,

  • data partitioning and replication through consistent hashing
  • consistency via object versioning
  • consistency among replicas during updates -> through a quorum-like technique and a decentralized replica sync
  • failure detection and membership protocol is gossip-based
  • Dynamo is completely decentralized

Dynamo Interface:

  • Put -> put(key,object,context)
  • Get-> get(key) -> returns object and context and all conflicting versions

Partitioning Algorithm

  • partitioning scheme relies on consistent hashing to distribute data to across multiple storage
  • nodes are assigned with a random value which serves as the position on the ring (Pnode)
  • each data item identified by a key is assigned to a node by hashing data item’s key to yield its position in the ring (Pdata)
  • walks through to find Pnode > Pdata

Replication

  • data is replicated on N-hosts, a coordinator node stores the kth data and also replicates it on neighboring nodes
  • a preference list is created to determine which node stores kth data
  • on failure every node can determine which node is on the preference list

Data Verisioning

  • provides eventual consistency
  • data puts are treated as immutable version
  • vector clocks: (node, counter). one vector clock is associated with a version
  • a truncation scheme is added to limit the size on vector clocks on long usage

Consistency

  • Quorum system
  • coordinator node coordinating read and write request. Write are immediately done on top nodes on the preference list (sloppy quorum)

Handling failures

  • hinted hand-off

Permanent failures

  • replica sync via merkle tree comparison on every nearby nodes

Membership and Failure detection

  • a node connects to a nearby node and reconcile histories to sync membership
  • node seeds to preven logical partitioning (nodes in a ring not knowing each other)
  • failure detection is detected through gossip also. locally knowing status of node by nearby node is enough since eventually all nodes know which node was detached or attached

For Riak DB Relevant Config Values

  • N
  • W
  • R

Source: http://robertgreiner.com/2014/08/cap-theorem-revisited/

Show your support

Clapping shows how much you appreciated rbahaguejr’s story.