I recently came across a blog about the categorization of consensus algorithms in Distributed Databases. I agree with the two broad categories, most of the examples and love the detailed explanation. But, I do not fully agree with the pros and cons mentioned in it and also felt the terminology was complex and leading to some confusion. This is my attempt to try and explain the same concepts based on my personal experiences and lots of hard learnt lessons.
Key concepts
What is a Database?
Honestly, I don’t have a good simple engineering answer for this. Google says “A database is a collection of data that is organized for use by a computer”. Thats too borad. Does it mean a Set of integers is a database?
We are more interesed in database management systems. Meaning it does a lot more stuff than a simple database. Things like access control, backups, auditing, and high availability. DB-Engines ranking tracks 423 of these! These are some of the most complex software programs in the world. More complex than operating systems.
What is High Availability?
Traditionally databases ran on a single machine (most still do). These were not the typical “off the rack” servers we get in the cloud today. They were enterprise grade servers with a much lower failure rate. But all hardware eventually fails, rats eat your power cords, sharks eat your network cables, your disks get headaches and stop spinning, or a ⚡decides to help warm up your data center. But much more common than rats and sharks is everyone’s most and least favorite thing about software… Upgrades! Upgrades to the OS, drivers, networks, and the database itself. So despite best efforts, these applications go offline. And when that happens, the impact is massive because important stuff is kept in databases. High availability of a database in a broad sense just means the database system can handle the failure of a machine, or some of its network, and still keep doing whatever it is that databases do. These are closed systems, so we are not concerned with bad actors, hallucinating networks, and other complex failures model.
The typical way to measure availability is by measuring the percentage of uptime, like 99.99%. That’s roughly a 4 minute outage every month. Most commercial managed databases offer 99.995% or even 99.999% (which is just 26 seconds of outage every month!). This is our goal.
Oh and we cannot loose data. Customers don’t like “shark ate your rows” RCA.
Replication
The smartest brains on the planet have concluded that the best way to survive the loss of a node (or machine) is to have a hot standby node with an identical copy of the data. It’s better if these nodes do not share the same power cords, network cables, buildings, zip codes, or even country. Standby means it can fully take over when the primary node goes offline, and hot means it can do so quickly. The challenge is in keeping an identical copy, meaning if you change the data, then you must do so on both of the nodes. This means actively replicating the data or its changes to the standby machine.
Why not just store the data in S3, NFS, or any other Blob storage that does this already? Of course the easiest way to do something is by making someone else do that work for you. But these systems have the same underlying challenges and use similar techniques to keep their data highly available and consistent. In the broad definition, these are databases themselves.
Leader election
Now that you have two identical nodes, you have the hardest problem in software. Naming things! One must be called a leader (or primary), and the other a follower (or standby). If you have more than two, then we shall call everything that is not the leader a follower to simplify our problem. At any point there must only be one Leader, and the followers must simply follow it. Having more than one leader will result in a split brain and weird dilemmas.
Only the leader can make changes to the data and guarantee read of the latest committed data. The standbys can only provide stale reads.
Consensus of commit
If data integrity and consistency are not a hard requirements, then you can perform the replication asynchronously and skip the multi-node commits. You don’t even need the concept of a single leader for these systems. Any node can serve writes, and it will do its best effort to replicate it to all the other nodes. These systems have the highest level of availability and are classified as AP. The applications using these databases do not care about these transient inconsistencies and have extra logic to eventually reconcile the data across the nodes.
On the other end of the spectrum (CP), we have synchronous replication with a single leader. Here, the leader needs to work with its followers to commit the changes across nodes before it can acknowledge back to the client. With a full commit, every node has to persist the change. Quorum or majority commit requires a majority of machines to persist the change. When electing a new leader, we make sure to pick the one that was part of this majority. Consensus of the committed data is a must to guarantee consistency and high availability!
You also have systems in the middle that try and get the best of the two by having leaders and async replication. But anything that is not purely CP is AP. And we don’t care about the ones without leaders because they are boring.
Shards/Replication units
Typically customers have more than one database that can share the nodes. Fully managed database vendors manage millions of databases running on thousands of machines. A replication units can be a group of databases, an individual database, an individual table, or a subset of the table. We don’t care about the dependency and relationship between these units so I am going to refer to each of these replication units as a shard.
Each shard has its own leader, followers, and one replication stream. Each copy of a shard on a node is called a peer. The leader peer replicates to the follower peer(s). This guarantees ACID properties only within a shard. To ensure atomicity across multiple shards, you need a transaction system on top of it.
Two types of leader election
So essentially the problem is that of leader election. All the nodes have to come to a consensus about who is the current leader for each shard. There are lots of ways to achieve this.
The easiest is what I call Light bulb and knob in Times Square. Put a light bulb for each node and a knob that sets the current leader in the middle of Times Square. If a bulb goes out, then some passerby would turn the knob and save the world. There are people in Times Square 24/7 so we can expect six nines of availability. But the rent is high there, so this option is not practical.
The economical option is to use more than 3 nodes and a consensus algorithm, Paxos or Raft, to pick the leader. Since you have more than one shard, this gives you two options for where to run these consensus algorithms. Per shard leader election (SLe) and Centralized leader election (CLe).
per Shard Leader election (SLe)
Each shard is responsible for picking its own leader. To do this they each run the consensus algorithms. These algorithms are very complex. So complex that if you decide to implement them from scratch, no one is going to easily trust that you did it correctly. The consensus algorithm can handle both the leader election and replication problem. But if you want to, you can split them and do the replication yourself (Ex. Neon). The consensus algorithms use heartbeats between peers to detect the failure of the leader and run an election amongst the remaining peers to pick a new leader. No external entity is needed to tell the shard when and what to do.
I am going to use the same list of examples from the other article (except Neon which I think belongs here).
- Aurora DSQL
- CockroachDB
- CosmosDB
- DynamoDB
- MongoDB
- Neon
- Redpanda
- Spanner
- TiDB
- YugabyteDB
Most SLe systems have one or more special shards that perform operations like load balancing of other shard leaders, data load balancing of shard peers, and storing global metadata like the location of all shard leaders.
Centralized Leader election (CLe)
As the name suggests, there is one central decision maker (Leader Manager), who is responsible for assigning leaders for all the individual shards. All shard peers heartbeat to the Leader Manager so that it can detect failures. On the failure of one shard leader, the Leader Manager will talk with all remaining shard peers and pick the peer with the most caught-up data as the new leader.
But! The Leader Manager needs to be highly available too. Who picks the Leader Manager? The same consensus algorithm is used here. But we just need to run it on one special shard in the system. This shard stores metadata only and performs less CPU-intensive work, so it can be hosted on smaller nodes.
And here are the examples:
- Apache Kafka
- AWS Aurora
- Azure SQL, Azure Synapse DW, Azure Fabric DW
- CitusDB
- Clickhouse
- ElasticSearch
- FoundationDB
- HDFS
- MemSQL/SingleStore
- Redshift
- PlanetScale
- Postgres
Which is better?
Finally! Let’s look at how the two compare.
Number of copies of data
Consensus needs 2f + 1 nodes to survive the loss of f nodes. So SLe requires 2f + 1 peers for every shard. CLe only needs f+1 peers for the shards with a full commit policy. So CLe needs fewer copies of data than SLe, right? But, CLe Azure SQL has 4 copies, and Aurora has 6, whereas most of the SLe databases like Yugabyte only have 3 copies. SQL Server Always On and Postgres with Patroni let you run with only 2 copies. But these are on the enterprise-grade machines where the failure rate is low. In the cloud, the failure rate are much higher. And more importantly, you (or something you depend on) upgrade all the time! If you want to keep the database running while 1 node is taken down for an upgrade, you need at least 2 more copies, since there is a high chance that one of the two is going to have a permanent disk failure. With 3 copies, one transient fault and one permanent fault means a slightly longer outage. With 2 copies, it’s data loss! The extra node is well worth the added cost.
The other reason is latency. Cloud nodes and networks slow down transiently. You pack a lot of apps across customers with underallocated resources on a bet that not all of them will spike at once. But sometimes they do, and your p99 latency suffers. With a full commit, even if one node is slow, the p99 has to take a hit. Quorum commit is more forgiving. So you are going to need at least 3 nodes for both SLe and CLe.
If reducing cost is the priority, then you can use heterogeneous nodes. 2 big nodes and 1 small node. If the big node fails, then you failover to the most caught-up one, and if that happens to be a small node, then gracefully switch to the bigger node. This can be done in both SLe and CLe. The complexity of implementation depends on the consensus algorithm, replication algorithm, and other factors. But the operational complexity of managing heterogeneous nodes can get far more difficult in the long term.
Time to recover from a failure
SLe should be faster to recover from the failure of the leader node. But CLe is not that far off. The difference between the shards themselves electing a leader vs. a central leader manager would only be a few milliseconds (network latency between shard peers and leader manager). But some consensus algorithms will have to perform multiple rounds of election to break ties or add random sleeps. So there won’t be any significant difference here.
If all peers are caught up (read heavy workloads), then CLe will be better at picking the optimal leader on the first attempt. SLe would pick a random leader and then require a load balancer to rebalance leaders evenly across nodes. 2 leader moves mean longer unavailability and more failed queries.
Simpler design
If you had to build these databases from scratch, which solution is simpler? Sadly, both are extremely complex to implement and run. Even though Raft helps simplify things quite a bit, it still takes years to get a working product out and even longer to squeeze out the bugs.
Replication technology
This, in my opinion, is the deal breaker. Building a replication technology for a database from scratch is HARD. But even harder than building a new replication technology is integrating a preexisting one with a different consensus algorithm (and trust me, it’s not fun)!
If you have existing replication technology, then reuse it with CLe. It’s easier to integrate any replication technique with CLe. No point throwing away the hard work and good code that is known to work well.
If you have to build one from scratch, you can choose either option. Raft handles both consensus and replication, which is what makes it so compelling. It’s so good that all the newer databases use it and end up in the SLe camp.
Extensible design
Eventually you are going to have to build an async replication for DR purposes, but this is far simpler than the sync consensus replication. With CLe you can use the same replication technology, whereas SLe will require you to build and maintain two of them.
Conclusion
If you are starting with a database that already has a solid and performant reputation solution, then use CLe. If not, then pick Raft, hence SLe.
You are in for a challenging and a wild ride!