Consistent Cuts in Dynamo-style Systems

Demand consistency metrics and consistent cuts from your eventually consistent systems

7 min readJul 21, 2019

Microsoft’s Azure Cosmos DB packs quite a punch with its spectrum of consistency level choices. A less celebrated feature, albeit a very powerful one, is consistency metrics, such as probabilistic bounded staleness (PBS), out of the box. These metrics answer a simple question: How eventual is your eventual consistency? In other words, how far back do we have to go to guarantee that all reads have a consistent view of the world without any gaps? Note that this is different from just replication latency.

What is a consistent cut

Consistency metrics are not just about monitoring the health of the system, although that is important. Being able to answer the above questions leads us to, perhaps, the most interesting consistency level choice that Cosmos DB offers — consistent prefix. A consistent prefix is defined as some prefix of all the updates, with no gaps. In distributed systems, this is the same as a consistent cut or a consistent snapshot. A consistent cut is a set of events that are causally ordered with no gaps. Formally, a consistent cut in an event-based distributed system, E, is defined as any subset F ⊆ E, such that

f ∈ F ∩ e → f ⇒ e ∈ F,

For F to be a consistent cut, it should satisfy the condition: if event f belongs to F and event e happens before event f, then e also belongs to F.

Chandy and Lamport were the first to propose an algorithm, back in 1985, to record such a consistent snapshot in a distributed system. As a side-note, Dr. Leslie Lamport is now employed by Microsoft, although I can only guess if he had anything to do with presenting consistent cut in the form of a consistency level choice.

Probabilistic Bounded Staleness

Peter Bailis, et al., came up with Probabilistic Bounded Staleness (PBS) in their paper — “Probabilistic Bounded Staleness for Practical Partial Quorums”. PBS gives you the consistency probability of any given commit in a distributed system. You can see a cool demo of PBS in action. What you get from PBS is a chart like the following:

Source: http://pbs.cs.berkeley.edu/#demo

In the above chart, at 25 ms of a given eventual commit, the probability of reading the commit goes to 100%.

You have at least a 90.64 percent chance of reading the last written version 10 ms after it commits.

If we are able to get a consistent cut for our system, we can also determine a consistency metric that tells us what is the most recent consistent cut available to us. Let’s call this — full consistency lag (FCL). Assuming no clock skew, FCL can provide us with a consistent prefix guarantee. An extension of this is bounded staleness where reads may lag behind by, at most, K updates or T interval, but are guaranteed to form a consistent cut, i.e., guarantee total global order without any gaps.

Use-cases for consistent prefix and tracking staleness in eventually consistent systems

Tracking staleness allows you to provide guarantees to your clients that would have been otherwise hard to do. Equipped with the staleness guarantee, clients can solve issues without always resorting to strong consistency. Here are some of the common use cases that call for a consistent cut or a staleness guarantee.

When you need total causal order

With consistent cuts, you get the throughput of eventually consistent writes, but your reads will only show you a consistent state at the expense of it being slightly stale. This property is useful for any use case where globally consistent causal ordering matters, even if it isn’t strictly real-time.

Say I would like to track the score of a soccer game¹ and I am ok for it to have a little lag. However, I would never want it to lie to me. Now, consider the following sequence of the score: 0–0 → 1–0 → 2–0 → 2–1. While it’s ok to report that order at whatever lag, it is not ok to report something like this: 0–0 → 0–1 → 2–1. The final score of 2–1 is eventually consistent with reality, but it is unforgivable for a soccer fan to see a score of 0–1 that never happened. There was no point in the game when the score was 0–1, yet we reported it as such. This calls for a consistent snapshot or a consistent cut of a system, even if that snapshot is stale. A fan following the score on her phone may have a 10-second delay but knows that the score is always true, in the right order, and every subsequent read is increasingly up to date. To be complete, the last property is known as monotonic reads.

¹Inspired by the baseball example from Azure’s paper and changed it to soccer for simplicity sake.

Synchronize cross-region events

Consider cache invalidation in multiple data centers. To invalidate a global cache after an update is made to one data center, it would have to let other data centers know to invalidate their caches. If the update is eventually consistent, it is possible to get into a race condition where a data center would invalidate its cache before the update has reached its data centers and miss the update resulting in a stale cache. If full consistency lag is known, data centers will delay invalidating their caches to make sure they have the most recent update. This way clients can keep a global cache synced up without resorting to strong consistency.

Note that the above example of invalidating caches is really an instance of synchronizing events across data-centers without taking the performance hit of strong consistency.

Compaction use case

If you have multi-master writes, chances are you need a consistent cut for some of your use cases.

Multiple processes in any region can do this compaction in a non-blocking way only if they are assured of a consistent cut, even if it’s stale.

Consider a multi-region distributed inventory system — EU and US. Both regions are selling the same product and write to their respective regions in a multi-master way. In order to have non-blocking writes, each writer simply writes an update of a new stock reservation as orders stream in. For example:

{“reserve_stock_EU”: 1}
{“reserve_stock_US”: 2}
{“reserve_stock_EU”: 3}

As the number of these records can increase pretty rapidly, the read latency for aggregations can go prohibitively high. So, we want to periodically consolidate the above, and replace the above documents with the following single “compaction” document and aggregate the counts per region:

{
“total_reserved_count_US”: 2,
“total_reserved_count_EU”: 4,
“compacted_updates”: 3
}

We store the count of all updates that are compacted in “compacted_updates” field. This helps us to choose the most recent compaction record as we will see soon.

Multiple processes in any region can do this compaction in a non-blocking way only if they are assured of a consistent cut, even if it’s stale. Since all of the compactors work with a consistent cut, the records are read in the same global order without any gaps. It is possible that one cut is more recent than the other. In the event of multiple compaction records, we simply choose the one with the greatest number of compacted_updates.

How to implement consistent prefix guarantee in Cassandra

Cassandra is a Dynamo-style system, meaning it follows, loosely, the concepts laid out in the Dynamo paper. To provide a consistent prefix guarantee, we would need a way to guarantee that a given subset of multi-master writes are causally ordered with no gaps. This requires a transaction log of updates, so we can cut off at a place where we are certain that there are no gaps. An easy way to model this in Cassandra is to create columns where the column names are timestamps, and the values are the updates. Equipped with this model and assuming no clock skew, we have the following two options to implement a consistent prefix guarantee in Cassandra:

Use PBS

Once we get the lag we need that gets close to 100% consistency, we only consider column names whose timestamps are older.

Exploit Hinted Handoffs

Cassandra implements the notion of hinted handoffs as detailed in Amazon’s Dynamo paper as a way of handling failure. This is what Cassandra leverages for fault-tolerant replication. If a write is made to a replica node that is unavailable, then Cassandra will write a “hint” to the coordinator node and try again later in a configured amount of time.

The main idea, here, is to poll all the nodes to see if they have any pending hints for other nodes. The time when they all report zero (tₐ) is when we know that there are no failed writes, and the only pending writes are those that are in flight. So, subtracting the Cassandra timeout (rpc_timeout) from tₐ gives us our full consistency timestamp (FCT). Any column name that is older than the FCT is considered in the consistent cut.

Conclusion

A consistent cut or a distributed global snapshot is usually thought of in the context of backups or disaster recovery. However, for a multi-master eventually consistent system, a consistent cut or a “consistent prefix” is imperative for satisfying a lot of use cases, without compromising write throughput.