Demystifying Consistency and Isolation for a Distributed Systems Engineer

Published in

abrai

17 min readSep 11, 2017

ACID has served as a convenient taxonomy for database semantics for decades. But tradeoffs in modern database system design often extend beyond just an A-C-I-D classification. High availability (partition tolerance), Performance, Session Guarantees, etc. are other important factors considered in designs today, having been popularized by new distributed data stores powering modern applications. This overlap between databases and distributed systems has been an active area of research and has yielded interesting results around isolation which we cover here.

Isolation Levels

Atul Adya’s PhD thesis in 1999 is a seminal work presenting a combined taxonomy of consistency semantics in databases. When talking of different types of consistencies, what we are really talking about is different types of “weak” consistency. Different weak consistency models form a partially ordered hierarchy with strong consistency at the top (specifically: “strong one-copy serializability”). Each type of weak consistency presents some opportunity to trade off consistency for availability and performance. In this discussion like in the literature, consistency and isolation are used interchangeably.

Below we present isolation levels while calling upon standard terminology where relavant. We start with covering isolation levels related to the ACID / ANSI SQL isolation levels. Note that this is not the only taxonomy of isolation levels. For example, modern applications additionally need to support session semantics, which have a slightly different way of thinking about isolation. We will see each of these categories below.

An alternate way of organizing isolation levels is in terms of “anomalies” that they avoid. An anomaly is undesired behavior which happens when transactions are not well isolated. For example, “dirty write” is an anomaly which is addressed by the “read uncommitted” isolation level. Indeed, isolation levels are typically defined in the literature by the anomalies they avoid more than anything else. We introduce various anomalies as we introduce individual isolation levels.

Note that this post has two interwoven threads — isolation and availability. You will notice that as we go from weak to strong isolation, our high availability options diminish — there is an inherent tradeoff between isolation and availability. So, alongside each isolation level, we include a blurb about availability under a section titled HAT (Highly Available Transactions). We recommend skipping this section the first time you are reading this post. Once you’ve finished going through various isolation levels and read up the overview of HAT towards the end of this post, re-read the descriptions of various isolation levels, specifically focussing on the HAT discussion.

Isolation related to ACID / ANSI SQL

1. ANYTHING IS POSSIBLE

No isolation is the most basic isolation level. Concurrent transactions in this model read and write shared data items willy nilly without any synchronization or ordering.

Next

The simplest possible improvement over this state of chaos is to honor the concept of transaction even at the most primitive level. To that end, we will control access to uncommitted values, which can be done in two ways: prevent writes to uncommitted values and prevent reads of uncommitted values, which brings us to read-uncommitted.

2. READ UNCOMMITTED / Adya’s PL-1 / NO DIRTY WRITES

Disallow (over)writes to uncommitted values, OR
In-flight transactions cannot overwrite items written by each other, OR
Disallow “dirty writes” anomaly (writes over uncommitted writes).
Interestingly, the common approach to providing this most basic isolation level, “last-writer wins” (details below) requires determining a total order of all transactions in the system — and this feature (total order) is applied in so many other contexts, e.g. session guarantees, eventual consistency (convergence). It’s also sometimes referred to as the “read uncommitted order”.

Details

A dirty write is when a transaction overwrites an uncommitted value written by another in-flight transaction. At the time of commit, integrity constraints between different objects may be violated. Note that this is a problem even if none of the transactions were to rollback. A handy way of detecting this is by finding a cycle in a graph of all transactions where an edge A → B exists iff A overwrites a value written by B.

Implementation choices

If txn A needs to overwrite an uncommitted value, abort A → unavailability → bad.
If txn A needs to overwrite an uncommitted value, block A; concurrently B may get blocked on A on another dirty value → deadlock → bad.
Buffer all writes of a txn and then write atomically to all values, but how to order txns?
Following up from the cycle observation, establish a total order between all txns, and let the “last writer wins” (preferred).

HAT

✅ If a total order can be established between transactions without requiring coordination (see HAT section below), then “last writer wins” can be implemented at each server without coordination and hence in a HA manner.

Notably, this isolation level is confusingly called READ UNCOMMITTED instead of DISALLOW DIRTY WRITES. Isolation levels are named in the affirmative. So, instead of identifying it through the anomaly it avoids (e.g. DISALLOW DIRTY WRITES), it’s named as (ALLOW) READ UNCOMMITTED since it does NOT disallow reads of uncommitted values, which also happens to be the target of the next higher isolation level, READ COMMITTED.

Next

Now that we have restricted writes to uncommitted values, next step is to extend this restriction to reads of uncommitted values, which brings us to read-committed.

3. READ COMMITTED / Adya’s PL-2 / NO DIRTY READS

Disallow “dirty reads” anomaly
Disallow all accesses (writes and reads) of uncommitted values.
Default isolation level in most DBMS.

Details

Reading an uncommitted value implies accessing a potentially inconsistent version of the database where integrity constraints on objects may be violated. A transaction based on this inconsistent view may break any integrity constraints during its writes.

Implementation choices

Most obvious choice is to buffer all writes of a txn until it’s ready and commit atomically.
In case of large writes, multi-versioning may be a viable choice as well.

HAT

✅ Write buffering is easily amenable to HAT.

Next

Now that we have ensured that all reads and writes access only committed values, what could go wrong?

Answer: we need to provide some sanctity on the committed values themselves. For example, which transactions should be visible, can older or newer transactions be read, are the reads stable, etc.? Of these, the most basic is ensuring read stability, which brings us to repeatable read.

4. REPEATABLE READ / ITEM-CUT ISOLATION / NO FUZZY READS

Disallow “fuzzy reads” anomaly, i.e. re-reading same object during a txn yields different values.
Reading same item multiple times from a txn (unmodified by txn) should return same value. (item-cut isolation)

Details

Confusingly, repeatable read has several definitions in the literature. Here, we are referring to the ANSI standard definition which we’ve used above. All other definitions including Adya’s are more advanced.

Implementation choices

Client to buffer all values read by a txn until end of txn.

HAT

✅ Read buffering is easily amenable to HAT.

Next

Extend to predicate based reads

5. REPEATABLE READ / PREDICATE-CUT ISOLATION / NO PHANTOM READS

Disallow “phantom reads” anomaly, i.e. while a txn is doing a predicate based read, another txn, see Contradictions below.
Repeatable read over a predicate e.g. select where … (predicate-cut isolation)

Implementation choices

Client to buffer all values read by a txn until end of txn.

Contradictions

Repeatable Reads as defined above cannot prevent phantoms as per the definition in [2]. On the contrary, [3] claims that this definition of repeatable reads does not suffer from phantoms. We agree with [2] because conceptually, while a txn is doing a large predicate read, another txn commits updates to some of the yet-to-be-read values, the approach proposed in [3] (caching middleware) will not help — you have to finish the predicate read first before you can cache it. [3] also proposes another alternate implementation which uses multi-version reads and has better chances of preventing phantoms.

HAT

✅ Read buffering is easily amenable to HAT.

Next

Now that reads are stable, we should ensure that only the effects of expected transactions are visible to an in-flight transaction. For example, can it read values written by an older transaction which have since been modified (and committed) by a later transaction? In the strictest sense, ensuring that we cannot read older values at all requires recency info and linearizability as we’ll see below (defn: order of transactions is same as wall clock order). But for now we’ll start with a more basic constraint on what is visible — monotonic atomic view.

6. MONOTONIC ATOMIC VIEW (MAV) / some protection from READ SKEW

Provides some protection from read skew anomaly, not complete protection, see below.
Once some effects of a txn A are observed by txn B, then all effects of A are visible to B.
Isolation effect of “Atomicity” in ACID — how much of a committed transaction should be visible to others — all or nothing.

Details

MAV does not prevent reading very old data which has since been overwritten — it’s not a recency constraint.
MAV does not prevent reading a more recently committed version of an item. For example, txn A wrote x, y. Txn B read x written by A, but when it gets around to reading y (say after indefinite delay), it reads a newer value of y committed by some other txn.
MAV is orthogonal to repeatable read (item-cut isolation) but their combination is sufficient to provide powerful features:
— Maintaining foreign key constraints
— Consistent global secondary indexing
— Maintenance of derived data

All of these behaviors relate to dependencies between records. Repeatable Reads and MAV don’t offer something special for these but they are a necessary low bar to ensure a minimum consistent view of the DB for the txn that is maintaining derived data. For example, RR and MAV do not solve lost updates which strictly speaking do not conflict with above use cases and hence are not a requirement for supporting them, but RR and MAV are.

Contradictions

[3] claims that MAV prevents read skew. Read skew is when a txn reads different keys committed by different last txn, and thus ends up with a view that violates some integrity constraint between the keys. For example (x+y=100), txn(A) writes x=10,y=90; txn(B) writes x=20,y=80; txn(C) reads x=10,y=80 => violates x+y=100. With MAV, read skew is prevented in that a txn cannot read older values but is possible because it can read newer values.

Implementation choices

Locking, OK in a single-node setting, but inefficient and non-HA in a distributed setting.
For an HA algorithm, refer to the section below on HAT algorithms.

HAT

✅ Achievable, algorithm in [3].

Next

Now that we’ve ensured that a txn can access only stable, committed values from other wholly committed txns, what’s still missing?

One thing we are missing is that there is no guarantee that the state accessed (read) by a txn ever existed in the database at any one point. Consider the contrived case of a txn stretching 1000 years, at various points, it may read new items for which it receives say recently committed value. Had the txn executed instantly at start, it may have read older values for these items which existed in the database at the time. This leads us to snapshot isolation.

7. SNAPSHOT ISOLATION WITH LOST UPDATES / NO READ SKEW, NO PHANTOM READS

Disallow lost update and phantom reads anomaly.
A txn executes against a (fully committed) snapshot of the database (normally: as it existed when the txn started).
Suffers from lost-updates: concurrent txns using the same snapshot can commit different values for some common key.

Details

Read skew is prevented because all reads are served from an actual committed state of the system at which point all integrity constraints were valid.
Phantom reads are prevented because predicate reads are satisifed by a snapshot of the whole system, not a continuously changing system underneath like in the case of Repeatable Reads and MAV above.
Improvement over MAV but lost updates is an issue.

Implementation choices

When starting the txn, identify a global timestamp (vector clock, version, etc.) that will be used for satisfying all reads. Specifically, no reads newer than this snapshot/version. Note that MAV already ensures that no reads older than this. Provide this version in all reads to the DB.
A hallmark of the implementation of this and higher isolation levels is that they need to keep the snapshot version “pinned” for the duration of the transaction. In a distributed setting, they would need to coordinate with background garbage collection of older versions.

HAT

Achievable, similar to MAV algorithm above but replicas return specific version only.

Next

Next step would be to get rid of lost updates.

8. SNAPSHOT ISOLATION / NO LOST UPDATES

Disallow lost updates anomaly
Snapshot Isolation with the guarantee that the txn itself will successfully commit only if no updates it has made conflict with any concurrent updates made since the starting snapshot.

Implementation choices

In addition to discussion above around snapshot isolation with lost updates, we additionally need to detect conflicts at commit time. This probably needs some centralized gatekeeper.

HAT

🚫 HAT systems cannot prevent lost updates. Conflicting concurrent txns served by different servers may be allowed to commit without coordination.

Next

We seem to be in a perfect world with all txns working off snapshots and aborting on conflicts during commit. What else can go wrong?

The problem is that the aborts are based on conflicting updates to the same key(s). What if two concurrent txns update a different set of keys and still conflict? For example, customer has two accounts with balances 20 and 80; goal is to charge a fee of 10 from customer; two concurrent txns are doing it based off the same snapshot (20,80); one txn deducts fee from 20 and another from 80; both commit without conflicts. This anomaly is called Write Skew. It’s similar to read-skew in that the skew is across multiple keys, except that this one is caused by writes not reads.

9. SERIALIZABILITY / NO WRITE SKEW

Disallow write skew
Snapshot isolation with full serialization → it’s hard to know how/where write skew will arise → serialize across the whole system.

Implementation choices

Single global lock or quorum.

HAT

🚫 HAT systems cannot prevent lost updates or write skew as they cannot use global/central control.

Next

What else can one need? As far as consistency and isolation are concerned, we are done. The next level of practical requirements are around recency. Does a txn read the most recent value of each data item? Can I bound that I read data no older than 5s?

10. LINEARIZABILITY / NO STALE READS

Each read of a data item returns the last committed value for that data item.
Recency constraint, the “C” in CAP theorem.

Implementation choices

Global lock per-transaction or quorum.
Note that even Zookeeper doesn’t guarantee this by default despite use of a quorum. A client may be connected to a follower and the follower may be trailing behind the leader. So clients can invoke the “sync” operation which forces follower to fully sync up with the leader as of that time, and ensures that subsequent reads from the follower are upto date as of the time of sync.

HAT

🚫 In the presence of a long network partition, HAT cannot provide any guarantees about recency, hence HAT cannot support linearizability.

11. STRONG/STRICT ONE-COPY SERIALIZABILITY

One-copy serializability + Linearizability

Isolation related to Session Semantics

Session semantics are an alternate framework for defining isolation. Whereas ANSI SQL Isolation deals with actual state of database and what each transaction observes, session semantics deal with behavior ACROSS TRANSACTIONS issued from the same session. Specifically, it defines behavior observed by the session as well as anyone else observing the session’s behavior.

MONOTONIC READS / NO REGRESSIONS IN READ ORDER

In the context of a session, subsequent reads of an object should never return an old value.

Implementation choices

Reads from each item progress according to some total order, e.g. the order from Read Uncommitted / last-writer wins is good enough.

MONOTONIC WRITES / OBSERVER SEES IDENTICAL WRITE ORDER

A global observer sees the same order of transactions in which they were committed in the session.

Implementation choices

Enforce read-uncommitted order / last writer wins, or any other scheme.

READ YOUR WRITES

Whenever a client reads a data item after updating it, the read returns the updated value (or a value that overwrote their written data).

Implementation choices

Client caching its writes is an option sometimes but not always, e.g. if dependent txns are expected from the server then client side caching will not help.
Servers need to synchronize, problematic with HAT except under sticky availability, see below.

PIPELINED RANDOM ACCESS MEMORY / PRAM

Provides the illusion of serializing each of the operations (both reads and writes) within each session.
Monotonic Reads + Monotonic Writes + Read your writes

WRITES FOLLOW READS / HAPPENS BEFORE

Happens-before relationship across transactions preserved across sessions.
If a session observes an effect of transaction T1 and subsequently commits transaction T2, then another session can only observe effects of T2 if it can also observe T1’s effects.

CAUSAL CONSISTENCY / PL-2L

PRAM + Writes follow Reads

HAT

To check for HAT, imagine what happens when client sends successive operations to different servers. How do we ensure that session semantics are maintained without synchronization?

Lower bounding of reads is not sufficient since — imagine a DB with a single object and reads being lower bounded to version V; subsequent reads of the object from different replicas may return V,V+1,V and still meet the lower bounding constraint but violate monotonicity. We can adopt the Write Hiding technique from HAT section below. Force servers to wait to reveal new writes (say by buffering them in separate local storage), until each write’s respective dependencies are visible on all replicas.

✅ Monotonic Reads: OK with total order of txns and write hiding.
✅ Monotonic Writes: OK with total order of txns and write hiding.
✅ Writes follow reads: OK with write hiding
🚫 Read your writes: NO, if client commits an update on a sever and then tries to read it from another server, it may not see the updated value.
🚫 PRAM: NO (no read your writes)
🚫 Causal Consistency: NO

HAT Under sticky availability

✅ All of session semantics and causal consistency support HAT semantics.

Other Semantics

Continuing our exploration of classifying isolation levels from ANSI SQL and Session Semantics, we cover additional consistency options in this section.

EVENTUAL CONSISTENCY / CONVERGENCE

In the absence of new mutations to a data item, all servers should eventually agree on the value for each item.

Implementation Choices

This is typically accomplished by any number of anti-entropy protocols, which periodically update neighboring servers with the latest value for each data item.
A final value can be established using a total ordering of all txns, e.g. last-writer-wins / read-uncommitted txns.

HA Transactions (HAT)

HA is defined as the ability for a system to satisfy any client’s requests as long as any one server is functional. Conversely, regardless of network partitions or extensive server outages, the system should be available as long as at least one server is functional.

MENTAL TRICK

When thinking of HAT algorithms, imagine that the client is sending each read/write in a transaction, or successive transactions to a different server each time. There are many servers in the system each of which is frequently stopping and restarting. Will the algorithm work in this environment?

Common Techniques

HAT algorithms share some common techniques which we list here.

Synchronization and Recency

By definition, HAT algorithms cannot rely on synchronization between servers as that would be vulnerable to network partitions. But synchronization is a spectrum. Even though real time synchronization may not be required, some background synchronization is useful for making progress. Synchronization can be quantified using a “Recency” measure, more frequent sync will provide more recent data to clients, but relying on more recency will hurt high availability. A hallmark of HAT is that synch/recency may be nice to have but not critical, whereas non-HAT algorithms like linearizability require strict synchronization.

Total ordering of txns

Even though global synchronization is expensive and non-HA, computing a total ordering of all txns is trivial and HA compliant. For example, txns can be ordered by a client or server generated txn ID of the form generator-id, generator-local-sequence. Total ordering can help resolve all sorts of conflicts.
Usage:
- dirty writes in READ-UNCOMMITTED
- total ordering in MONOTONIC READS
- total ordering in CONVERGENCE / EVENTUAL CONSISTENCY

Lower bounding reads

HAT algorithms can operate without recent data due to the lack of synchronization, but often they need to ensure that all the data they are reading is consistent at least. Consistency has various shades too, but the HAT compliant ones usually require only that they don’t go backwards, for which lower bounding is useful.
Usage:
- MAV algorithm below

Write Hiding

Some isolation levels require monotonicity in client’s observed behavior across accesses to different servers (see Monotonic Reads for example). Short of an MVCC datastore, there’s no solution to this problem that can implemented entirely at read time (see example in Monotonic Reads section). So the trick is to prepare for this during writes itself.
Usage:
- Session semantics (Monotonic read/write + writes follow reads)

When a server receives a new write, it stores it but does not make it immediately available for reads. Such reads are called pending-stable. When the write has been similarly stored on each replica of the item (orchestrated by the client or the service), a 2-phase commit (2PC) moves the write from pending stable to committed — making it available for reads. The latency of 2PC does not affect HA since HAT txns do not care about recency. So, as long as some background protocol ensures liveness with the 2PC, this is a viable strategy for HA.

MAV Algorithm

From [3].

Read Uncommitted (last-writer-wins, total order of txns) plus
Read Committed (buffer all writes in client), plus
Write Hiding: Replicas wait to reveal new writes to readers until all of the replicas for the final writes in the txn have received their respective writes (are pending stable).
Clients include additional metadata with each write: a single timestamp (vector clock) for all writes in the txn (e.g. as in read-uncommitted) and a list of items written to in the txn.
When a client reads, the return value’s timestamp and list of items form a lower bound on the versions that the client should read for the other items.
When a client reads, it attaches a timestamp of its request representing the current lower bound for that item.
Replicas use this timestamp to respond with either a write matching the timestamp or a pending stable write with a higher timestamp.
Servers keep two sets of writes for each data item: the write with the highest timestamp that is pending stable and a set of writes that are not yet pending stable.
Entirely master-less and operations never block due to replica coordination.

References

[1] Towards an Isolation Level Standard, Adya et al, 2000
[2] A Critique of ANSI SQL Isolation Levels, Berenson et al, 1995
[3] Highly Available Transactions: Virtues and Limitations, Bailis et al, 2014
[4] The Morning Paper Adrian Colyer
A Critique of ANSI SQL Isolation Levels
Quantifying Isolation Anomalies
Highly Available Transactions: Virtues and Limitations
Generalized Isolation Level Definitions

Demystifying Consistency and Isolation for a Distributed Systems Engineer

Isolation Levels

Isolation related to ACID / ANSI SQL

1. ANYTHING IS POSSIBLE

2. READ UNCOMMITTED / Adya’s PL-1 / NO DIRTY WRITES

3. READ COMMITTED / Adya’s PL-2 / NO DIRTY READS

4. REPEATABLE READ / ITEM-CUT ISOLATION / NO FUZZY READS

5. REPEATABLE READ / PREDICATE-CUT ISOLATION / NO PHANTOM READS

6. MONOTONIC ATOMIC VIEW (MAV) / some protection from READ SKEW

7. SNAPSHOT ISOLATION WITH LOST UPDATES / NO READ SKEW, NO PHANTOM READS

8. SNAPSHOT ISOLATION / NO LOST UPDATES

9. SERIALIZABILITY / NO WRITE SKEW

10. LINEARIZABILITY / NO STALE READS

11. STRONG/STRICT ONE-COPY SERIALIZABILITY

Isolation related to Session Semantics

MONOTONIC READS / NO REGRESSIONS IN READ ORDER

MONOTONIC WRITES / OBSERVER SEES IDENTICAL WRITE ORDER

READ YOUR WRITES

PIPELINED RANDOM ACCESS MEMORY / PRAM

WRITES FOLLOW READS / HAPPENS BEFORE

CAUSAL CONSISTENCY / PL-2L

Other Semantics

EVENTUAL CONSISTENCY / CONVERGENCE

HA Transactions (HAT)

MENTAL TRICK

Common Techniques

MAV Algorithm

References

Written by Abhishek Rai