4 Stereotypes of Multi-Region Database Cluster

11 min readAug 14, 2023

When it comes to building a cross-region database cluster, we face with some harsh realities: cross-region network roundtrip takes tens to hundreds of milliseconds depending on the distance of the two regions [1]. But data needs to be replicated cross-region to be highly available and durable. Because of these, we need to make some difficult trade-offs on availability, durability and latency in a cross-region DB cluster.

In my opinion, there are only 4 stereotypes of cross-region DB clusters that actually work [2]:

Single writer with synchronous standbys.
Multi-writer quorum cluster.
Single writer with asynchronous read replicas.
Multi writer with conflict-free replicated data types (CRDTs).

I use these simple stereotypes as my mental model when reasoning about database replications [3]. Other variants may look different, but they are eventually just one of these, any any attempt to break out of these 4 stereotypes are going to be broken (i.e. may cause split brain or unexpected data loss). So when people talk about a new “cross-region DB cluster” design, I try to find out which stereotype (or none of these) they are really talking about.

Stereotype 1: Single Writer with Synchronous Standbys

In this stereotype, there is a single primary node, and multiple cross-region synchronous standbys. With synchronous replication, transactions must be first replicated cross-region before they are committed to the client, thus all data is highly durable on region outage.

Failover

With this setup, it is feasible to achieve high availability (HA) and durability (in another word, low RTO and 0 RPO) on regional outage simply by automated failover to promote a standby to new writer.

Database failover requires two fundamental building blocks to work correctly: quorum and fencing.

First, it requires a quorum (like Paxos) of “failover agents”. This agent decides who is the DB writer, and executes failover when needed. Without a distributed quorum, the failover agent is either a single point of failure (i.e. unavailable when a single node fails), or is incorrect (can lead to split brain). This quorum can be either embedded into the DB engine, or run as a standalone cluster. One example of failover agent is MySQL Orchestrator. Another good example of embedded quorum is from Meta’s MySQL Raft. Some failover agents are built on top of a distributed config store like etcd. There are other variants, too.

Second, it also requires a fencing mechanism, so that after a new writer is promoted, the old writer (in case it auto recovers somehow) must be blocked from committing new writes from clients. This is essential to prevent split brain. In synchronous replication cluster, fencing can be achieved simply by disconnecting all synchronous standbys from the demoted primary. Since we enforce synchronous replication, all writes are blocked if no replica can accept them. [4]

Limitations

The biggest drawback of this stereotype is high latency: any write transaction requires at least one cross-region network round trip due to synchronous replication. Furthermore, if the client is in a different region from the writer, it requires extra network roundtrips to execute queries as well.

This stereotype is also “modal” in the sense that once a failover happens and primary region changes, the physical distance between clients and DB primary can suddenly change dramatically, resulting in sudden added latency and stress to application layer. This may uncover new cascading failure modes that you would normally not test for or see in production, thus result in a larger outage by surprise.

Stereotype 2: Multi-Writer Cluster with Quorum Coordination

With this stereotype, each member of the cluster can accept writes from clients, and every single write requires a conflict resolution under distributed consensus quorum before committed.

A typical example of this is MySQL group replication: to resolve concurrency conflict, a multi-Paxos like algorithm called XCOM was introduced to determine a global total ordering of all transactions, before they are committed or rejected due to conflicts.

On single node failure, the other 2 writers can simply continue to work as-usual because the quorum is alive despite partial failures. This is an advantage compared to single-writer (stereotype 1) as it significantly reduces the concern of modal behavior discussed above.

Bring the writer near clients

With this multi-writer setup, at least one cross-region roundtrip is still required for synchronous quorum coordination, there is no magic to avoid that. However, because all DB nodes can accept writes, we can place database nodes closer to end users, to avoid cross-region network roundtrip between client and the DB server. This is a significant advantage compared to stereotype 1 above, especially for interactive transactions with multiple queries, where each query requires a client to DB roundtrip.

Single Writer: multiple cross-region network roundtrips

Multi-Writer: one cross region network roundtrip

Because of the two advantages (bring writer near client, prevent modal behavior), I think multi-writer is more suitable than single-writer in most cross-region use cases, despite its additional architectural complexity.

Stereotype 3: Single Writer with Asynchronous Read Replicas

This is probably the most-common stereotype offered by cloud service providers today. It is extremely simple: a single writer in one region, with multiple asynchronous read replicas in other regions. The main advantage of this stereotype is that write is very fast on the primary region (assume client is on the same region), and it is simplest to build and operate as a service.

However, there is no high availability (HA) or durability guarantee whatsoever. On writer region disaster, no auto failover will take place (RTO is unbounded), and there can be data loss (RPO > 0) due to asynchronous replication. Some would argue that this does not even count as a “cluster” due to the lack of high availability.

Why not auto failover?

So why not just support auto failover, like stereotype 1 above? One obvious reason is to prevent data loss. Region outages are (most likely) recoverable. Once the region comes back, you can recover the lost data. But once failover happens, the lost data is no longer recoverable because it may conflict with new writer’s data (risk of split brain). [5]

Another less obvious, but more fundamental reason is, there is no feasible way to build fencing. As discussed in stereotype 1 above, for synchronous replication, we simply need to force disconnect all replicas to fence writer. But this doesn’t work for asynchronous replication: a demoed writer will happily commit more writes and cause split brain with the new writer. [6]

(Not quite) low latency

With asynchronous replication, one might expect low write latency to be a major benefit. This is absolutely true for use cases where most clients are in the same region as the DB writer. But just like the stereotype 1 above, latency can be high if client is cross-region, especially for interactive transactions (see below). As is discussed above, synchronous, multi-writer cluster (stereotype 2) may be even faster in these cases by placing the writers near clients.

Stereotype 4: Multi-Writer with Conflict-Free Replicated Data Types (CRDTs)

What if we allow multiple writers to all accept writes at the same time, allow each to commit locally, and just replicate to other peers later?

This is a perfect solution for latency: writers are placed globally near clients, and all write transactions require 0 cross-region roundtrip to commit. The write latency will be low everywhere, and overall database performance will be really, really fast.

CRDTs and CALM theorem

However, this comes with a major drawback: transaction conflicts must be resolved after they are committed. There is no way to prevent data divergence, unless the database only relies on “conflict-free replicated data types” (CRDTs). Examples of this are counters, sets, last-write-win registers (with a global atomic clock), etc. When two clients write CRDT data concurrently, no coordination is needed to “merge” the results, and the end state is deterministic regardless of which order these concurrent changes are applied to each DB node.

The CALM theorem provides a great mental model for this, that DB workload is conflict-free if and only if it is monotonic. Monotonicity refers to a workload property where, for any input sets S and T, if S is a subset of T, then the output of the program on S is a subset of the output on T. This essentially means if a DB write decision is made with limited information, it will still remain valid as more information (like transactions committed by other nodes) become available.

The 2007 Dynamo paper is a typical example of such a “conflict-free” application: shopping cart. If customer adds items to the shopping cart from multiple places, the end result is a deterministic union of them all.

Developer challenges

As you can see, the CALM theorem has very limited use cases, that makes it only suitable for a some purpose-built data processing applications like shopping cart. Often some features and customer experiences need to be sacrificed because they cannot be supported by the database model. In fact, even the AWS DynamoDB service did not follow the programming model from the original 2007 Dynamo paper.

This database architecture also requires some in-depth understanding of the programming model by application developers. They must handle corner cases where a change is committed but later mysteriously disappears (e.g. lost in conflict-resolution due to last-write-win). Such complexity can result in extremely nasty and hard-to-test bugs that will make application development extremely challenging and slow. Nowadays developer time and intelligence is often the most scarce resource in an organization, that I think this is often not a good trade-off to make code development harder and slower for the extra DB performance.

In Conclusion

Each of these 4 stereotypes are “possible”, in the sense that they can deterministically prevent unresolvable conflict and data divergence.

Multi-writer wins over single writer (stereotype 1) in that it brings writer closer to client, and it avoids modal behavior. Both are significant concerns in cross-region replications, where regions can be different in geo-location, hardware capacity and reliability, that changing the writer has big impact to the application layer.
Multi-writer cluster is also better than asynchronous cluster (stereotype 3) in that it adds support to disaster recovery with 0 RPO and 0 RTO. This comes with the trade-off of one extra network roundtrip.
Stereotype 4 (asynchronous multi-writer clusters with CRDTs) is only suitable in limited, purpose-built use cases. It also requires application developers to have deep understanding of the data model, and can be hard to debug in many corner cases. This is a significant tradeoff on developer velocity for DB performance.

In fact, cross-region latency may not be that bad as it sounds like. For example, the network roundtrip latency between London and Paris on AWS is only ~ 10 ms. In comparison, MySQL database commit latency on a single-region, multi-AZ cluster is typically about 5 ~ 10 ms: they are close enough. Also in most use cases, TPS (transactions per second) is much more important than latency. Databases with cross-region replication may actually achieve very high TPS in many cases, as long as you don’t hit some performance bottleneck too early (like cross-region network bandwidth).

Nevertheless, stereotypes are, as the name suggests, over-simplified mental shortcuts. As an engineer, you always want to work backwards from your specific user experiences and business requirements, and think ground truths, not shortcuts.

Footnotes

For example, AWS regional cross-availability-zone (AZ) network roundtrip takes 100 µs ~ 1 ms, whereas cross-region network roundtrip takes from 10 to hundreds of milliseconds. So latency is not a big concern for a regional, multi-AZ cluster, but a significant concern in cross-region databases. This website gives some heuristic view of cross-region network latency among different AWS regions: https://www.cloudping.co/grid .
“They will work” in the sense that they can deterministically prevent unresolvable conflict on committed data sets that would cause split brain issue. This is a common requirement for OLTP databases that store “source-of-truth” of customer data. But sometimes this may be actually not a design goal. For example, it is often not a big concern for OLAP databases which often extract and collect data from multiple “source-of-truths” that don’t conflict with each other, thus conflict resolution can be skipped to tradeoff for better performance.
To simplify our mental model, I will only discuss single-partition DB clusters in this blog, where all nodes share the same data without sharding. I think this greatly simplify the subject matter, and the main conclusions here applies to multi-partition, horizontally scalable clusters at per-partition level as well (often seen in NoSQL and NewSQL databases). I will also assume a cluster has only 3 nodes in 3 regions. Adding more nodes, or multiple nodes per region introduce new scenarios to discuss, but they do not fall out of the 4 stereotypes discussed here either.
More generally, in an N-nodes cluster (N≥3), fencing can be done if we require a transaction to be persisted in at least (N+1)/2 (strict majority) of nodes to be committed, and also require (N+1)/2 standby nodes to disconnect from writer on failover.
A 24-hour-long GitHub outage in 2018 vividly demonstrated such scenario where they made some difficult trade-offs between downtime and data loss.
There is a middle ground between stereotypes 1 and 3, e.g. asynchronous replication that tolerates at most X seconds of lag from replicas before new writes are blocked. This way, automatic failover is feasible with at most X seconds of downtime and potential data loss / divergence. Nevertheless, data divergence is not entirely prevented, thus it does not count as another stereotype in this blog.

All opinions are my own.