Vector Clocks: Amazon DynamoDB (Part 2)

Aditya Shete
4 min readFeb 11, 2023

--

A look at how vector clocks help reconciling divergent versions.

Amazon DynamoDB icon

This is a blog is part 2 of a series of posts covering the concepts underlying Amazon DynamoDB. (Part 1 on consistent hashing)

In the last post we looked at how DynamoDB uses consistent hashing to enable scaling of nodes in the system. This helped us manage various properties of the system such as the load on each node, the distribution of keys-value pairs and gracefully managing node failures.

So, our system is now scalable, is able to handle writes at the 99.9th percentile, what should be the next concern that we should solve? Looking at the system requirements that we have, we see that DynamoDB requires that it should always be available for writes. This availability cannot be compromised in face of network partitions or node failures. The replication we achieved in the last section is good to recover in case of node failures, but it doesn’t guarantee that the system is highly available.

To understand the situation, we can imagine a system under temporary network partition. This system according to our requirements is taking in writes on both sides of the partition. Supposing that a client updates the same key-value pair on both sides of the partition, we have two versions of the pair in the system. There are some obvious problems here:

  • How does the system reconcile the divergent versions when the partition has recovered?
  • When an external system reads the key which version is returned?

Let us take a look at how DynamoDB responds in such a situation:

DynamoDB is an eventually consistent system. In line with the CAP theorem which makes it that an available system under partitions cannot simultaneously also be consistent. The consistency is achieved at a later point, or eventually, by asynchronously communicating local changes to nodes.

It is assumed that each updated is a new version of the data. There are situations where even under partitioning the versions of pairs subsume each other in which case versions can be merged without conflict. The key idea is regarding how to version systems and what to do when systems diverge?

Vector Clocks

I have already covered the basic idea behind vector clocks and how they help us establish causality in a distributed system. Basically, in absence of a central timing mechanism to establish the concept of later/earlier we have to use some other mechanism to establish the causality of events. Vector clocks is one manner in which we can determine causality. This helps us use the simple merging property that the latest copy of a key-value pair should exist in the system.

We define a vector clock as a list of (server, version) pairs, so that when a server handles a write, it increments the corresponding version. Something like this list:

vectorClock = [{S0, ver0}, {S1, ver1}, ....]

In a single server situation establishing which update is latest is a simple exercise, we merely compare the versions. Extending this to a distributed system, a version is deemed to be later if all the pairs are such that the version are either greater or equal.

Consider the following divergence in the system:

A possible evolution of the system under network partition

We see that the divergent versions cannot each be said to be later that the competing version. Under such a situation DynamoDB sends both conflicting versions as context to the client. Any update using this context is determined to have resolved the conflict and the system will merge to the new update. The client is free to use domain specific knowledge so it can determine which version is appropriate to keep or whether it would like to merge parts conditionally.

The key design decisions that were reflected in this discussion are:

  • To preserve write latency and availability, the system is sacrificing consistency requirements.
  • Further, the consistency checks are made at read time, instead of write time with view of maintaining system availability.
  • The system is flexible enough that syntactic merging, i.e., strictly later versions subsuming earlier versions, is implemented to cover most scenarios.
  • In case of domain specific merges, i.e., semantic merging, where divergent versions conflict and require external intervention, are passed on to client for resolution.

Concluding Notes

One possible drawback of the system outlined here for versioning could be that eventually the vector clock becomes intractable. The paper mentions that such issues have never been observed under operation. Further, the vector clocks are truncated beyond certain threshold, sacrificing accuracy in merging for performance.

It possible that the schema that we have specified here results in some deleted objects resurfacing. This is because we might have resolved that the latest system object is one where the deletion did not take place.

--

--