If you like the series, check out my upcoming book on Database Internals!
This is a series of articles introducing Distributed Systems concepts used in databases. First article of the cycle, was about Links, Two Generals and Impossibility Problem. If you’re interested in the subject, you can also check series about Disk IO.
Main goal here is to help to build up knowledge that will help you to understand how databases work and what decisions database implementers make, to be able to better operate databases or get up to speed and start working on one.
Today we’ll continue building up prerequisites for going into details of advanced consensus algorithms, starting with Shared Memory and building blocks for the distributed storage, then introduce Linearizability, figure out ways to implement it such as Two/Three Phase Commit and finally move to the Atomic Broadcast.
For a client, distributed system responsible for data storage, looks like it has shared storage, as it acts similar to a single-node system and internode communication and message passing are abstracted away and are happening behind the scenes. This creates an illusion of shared memory. Single unit of storage, accessible by read or write operation is usually called Register. Shared Memory in a distributed database can be viewed as an array of registers. Operation is an action executed against a register, either changing its value or reading for the client.
Register may be accessed by a single or multiple readers and/or writers, so we distinguish registers by
(1, N) number of readers and writers. For example
(1, 1) is a single-writer, single-reader register;
(1, N) is a single writer, multiple-reader and
(N, N) is a multiple-writer, multiple-reader.
Read or write operations on registers are not immediate, their execution takes some time and concurrent read/write operations performed by different processes are not serial. Depending on how registers behave when operations overlap, they may return different results. Depending on register behavior in presence of concurrent operations, we distinguish safe, regular and atomic registers.
Every operation is identified by it’s invoke and complete events. Operation is said to fail if process that invoked this operation has crashed before the operation completed. If both invoke and complete events for one operation are happening before invoke for the other operation, this operations is said to precede the other one, and these two operations are then sequential, otherwise operations are concurrent.
Safe registers might return an arbitrary value within the range of the register during concurrent write operation (which does not sound very practical, but might describe semantics of an asynchronous system that does not impose serializability). Safe registers with binary values might appear flickering during reads concurrent with writes. For regular registers, guarantees are slightly stronger: reads will return only the newly written value or the previous one. In this case, system has some notion of operation ordering, but visibility is not simultaneous (for example, might happen in a replicated database where reads are served by nodes independently).
Atomic registers guarantee linearizability: during every write operation there exist a point in time before which every read operation returns an old value and after which every read operation returns a new one. This is a very important and useful property that simplifies reasoning about system state.
Probably the best definition of Linearizability was given by Maurice Herlihy and Janette Wing in their Linearizability: Correctness Condition for Concurrent Objects paper.
Linearizability is a correctness condition for concurrent objects that provides the illusion that each operation applied by concurrent processes takes effect instantaneously at some point between its invocation and its response, implying that the meaning of a concurrent object’s operations can be given by pre- and post-conditions.
In order to achieve linearizability, a system should guarantee there’s a “point of no return”, after which the whole system commits to the new value and won’t go back to revert it. Moreover, this all appears instantaneous (e.g. there’s no period during which the decision appears “flickering”).
While safe and regular registers are mostly described in terms of single-writer-multiple-readers, atomic registers work correctly and preserve visibility guarantees in multiple-writer-multiple-reader systems. This man be achieved with, for example, CAS (compare-and-swap) or mutual exclusion.
Implementing linearizability in a distributed system might be difficult and expensive, so there exist other useful models with weaker guarantees, such as sequential consistency, which states that result of the execution of operation set is the same as if all operations were executed in some sequential order.
Since, in order to guarantee sequential consistency, all the write operations are required to be seen in some order by all processes, this can be further relaxed with causal consistency, which specifies that only causally related writes must have a particular order.
Two Phase Commit
In order to implement atomic consistency / linearizability, one should use algorithms that preserve this property: changes to the register are atomic and become effective either for all processes simultaneously or to none at at all. Let’s now take a look at some algorithms that can help to make it happen.
It took us quite some time to cover most of the prerequisites and finally we’re getting to actual consensus algorithms. To recap, consensus algorithms in distributed systems allow multiple processes to reach an agreement on a data value.
Consensus has three properties:
- Agreement — decision value is the same for all processes
- Validity — decided value should be proposed by one of the processes
- Termination — all processes eventually reach the decision
Two Phase Commit (2PC) is probably the simplest algorithms for distributed consensus (however, one not without shortcomings). It’s often discussed in context of database transactions that either have to be applied atomically for all participants or not at all. 2PC assumes a leader (coordinator), that holds a state, collects votes and is a primary point of reference for the agreement round. Coordinator is just a temporary role and any process can take on it, the rest of the nodes are called cohorts. One important factor here is that value decided on is picked by the coordinator, and consensus has to be reached only on whether or not it’s accepted.
Decision round is split in two steps:
- Propose: coordinator distributes a proposed value and collects votes from the participants on whether or not this proposed value should be committed.
- Commit/Abort: if a number of collected votes satisfies coordinator’s commit criteria, decision is taken and all the processes are notified about it. Otherwise, all the processes are sent the abort message and transaction doesn’t take place.
Let’s consider several failure scenarios. For example, if a replica fails during propose phase, its “opinion” on whether or not decision should be taken by the coordinator is can not be taken into account, so it’s logical for the coordinator to abort the transaction.
Since the main idea behind the Two Phase Commit is a promise by the replica that, once it has issued a positive vote, it effectively promises to commit the transaction. It can’t go back on it’s decision and after vote is done, only the coordinator has a right to abort.
In the crash-failure scenario (when the replica fails and never comes back), algorithm has to proceed without it and its further participation is not affecting the outcome. What happens more often is crash-recovery, where the node comes back and needs to get up to speed with a final coordinator decision. Usually this is done by persisting the decision log on the coordinator side and replicating decision values to the failed or lagging participants. Until then, replica can not serve requests as it is in an inconsistent state.
Since only the coordinator has a right to make a final decision to commit or abort the transaction, its failure might cause much bigger problems. Since 2PC does not involve communication between replicas, transaction state can not be recovered after propose phase. Inability of the coordinator to proceed with a commit leaves the cluster in undecided state. This means that replicas will never be able to learn about the final decision in case of a permanent coordinator failure.
We’ll discuss other more resilient algorithms, where state is distributed and can be communicated between the peer nodes. But for now, let’s first take a look at the algorithm that can help to solve the 2PC problem with blocking coordinator.
Three Phase Commit
In order to make 2PC more robust to coordinator failures and avoid undecided states we can improve it by adding timeouts and an extra Prepare phase before Commit/Abort, which communicates replica states collected by coordinator during the Propose phase, allowing the protocol to carry on even in case of the replica failure. Other properties of 3PC and a requirement to have a coordinator for the round are similar to its two-phase sibling.
To recap, Three Phase Commit decision round consists of three steps:
- Propose: similar to 2PC, the coordinator distributes the value and collects votes from replicas. If the coordinator or one of the replicas crash during this phase, transaction is aborted on the timeout.
- Prepare: after collecting the votes, the coordinator makes a decision. If the coordinator decides to proceed with a transaction, it issues a Prepare command to the replicas. It may happen that the coordinator can distribute prepare messages to only a part of replicas or fails to receive their acknowledgements. In this case, transaction is aborted, since there’s no way to guarantee consistency in case of the coordinator failure. After all the replicas are successfully moved into Prepared state and coordinator had received their prepare acknowledgements, transaction will be committed even if either side fails. This can be done since all participants on that stage have the same view of the state.
- Commit: coordinator communicates the results of the Prepare step to all the participants, resetting their timeout counters and effectively finishing the transaction.
As we discussed previously, 2PC can not recover from the coordinator failure and will be stuck in a non-deterministic state if coordinator does not come back. 3PC solves this problem by adding timeouts to the protocol and ensuring that positive decision can be correctly taken even in case of the coordinator failure.
For 3PC, the worst-case scenario is a network partition: when some nodes have received Prepare message and successfully moved to the Prepared state, but some nodes can’t communicate with the coordinator. This results in a split-brain: according to the algorithm, the set of nodes in Prepared state should proceed with a commit and the ones that can’t reach coordinator some abort, leaving the cluster in an inconsistent state.
In both of the algorithms discussed today, participants have to keep a log of actions for execution to get failed processes up to speed and distribute the state where it’s missing. This brings us one step closer to our initial goal with consensus algorithms: replicated state machines, which build on the ideas we’ve discussed here and combine multi-step communication, action logs and timeouts to produce even more robust algorithms. Before we get there, let’s discuss a couple more important concepts.
Leader Election is used to determine which processes will have a special role of coordinating the steps of a distributed algorithm. Generally, processes are uniform and leadership role is somewhat arbitrary, as any process can take on it. Cachin in his book describes an example of Monarchial Leader Election, which uses a failure detector to maintain a ranked list of potential candidates, excluding crashed processes from this list and electing the process with highest rank from available remaining ones.
Tanenbaum and van Steen describe a similar Bully algorithm, where all processes try sending messages to processes with higher identifiers. Whenever no-one responds (there are no processes with a higher rank), process with a highest rank becomes a new leader. These algorithms are straightforward and easy to implement, but do not cope well with a case when highest-ranked process is flaky.
Election is triggered when the nodes are unaware of the current leader or can not communicate with it. Algorithm has to be deterministic and lead to election of exactly one leader process; this decision needs to become effective for all participants.
Liveness of the election algorithm implies that most of the time there will be a leader (until the next election round is required). Safety implies that there might be either zero or one leader at a time, so brain split situation (when two leaders are elected but unaware of each other) is not possible.
Many consensus algorithms, including Paxos, Zab and Raft rely on a leader for coordination. Having a leader might simplify the algorithm by having a single participant in charge of a subset of actions, but this also introduces a bottleneck and a point of failure.
And, finally, the last thing we need to discuss. Broadcast is a communication abstraction often used in distributed systems. Broadcast is used in order to disseminate information among a set of processes. There exist multiple broadcasts, similarly to what we’ve discussed with link previously, all making different assumptions and providing different guarantees. Broadcast allows sending a message to a group of processes, ensuring an agreement among the recipients about delivered messages.
The simplest way to broadcast messages is Best Effort Broadcast. It eventually delivers the message
M to all correct processes, ensuring that every message is not delivered more than once. This type of broadcast is using perfect links, so the message will be delivered given both sender and recipient are correct. When one of them fails, this type of broadcast will just fail silently.
For Broadcast to be Reliable, it would need to guarantee that all correct processes agree on the set of messages they deliver, even if the sender crashes during transmission. In order make broadcast reliable, we need a failure detector. When the source process is detected as failed, other processes detect that and broadcast messages instead of the original source, “flooding” the network with N² messages. Uniform Reliable broadcast, in addition to ensuring that correct processes deliver the same set of messages, guarantees that failed processes deliver a subset of messages delivered by correct ones.
Even stronger guarantees are given by the Atomic Broadcast, which guarantees not only causal order, but total order. While reliable broadcast ensures that the processes agree on the set of messages delivered within the system, atomic broadcast also ensures they agree on the same sequence of messages (e.g. message delivery order is the same for every participant). Messages here are delivered “atomically”: every message is either delivered to all processes or to none of them and, if the message is delivered, every other message is ordered before or after this message.
For total order to be possible, participants have to find common grounds for delivery ordering. Crash-recovered participants, along with the ones that have agreed on certain values, have to be able to learn and comply with ordering. Since atomically broadcasting series of messages is equivalent to deciding on each value one after the other, atomic broadcast problem reduces to a sequence of consensus problems.
As you can see, Distributed Systems are all about building blocks, failure scenarios, assumptions and guarantees. It is very important to understand semantics of the algorithm to be able to use it correctly.
With Linearizability, Leader Election and Broadcast we’ve covered most of the things needed to understand replicated state machine algorithms, such as Paxos and Raft, which we will explore in the next post.
If you find anything to add or there’s an error in this post, do not hesitate to contact me, I’ll be happy to make corresponding updates.