If you like the series, check out my upcoming book on Database Internals!
This is a series of articles introducing Distributed Systems concepts used in databases. First article of the cycle, was about Links, Two Generals and Impossibility Problem. If you’re interested in the subject, you can also check series about Disk IO.
Main goal here is to help to build up knowledge that will help you to understand how databases work and what decisions database implementers make, to be able to better operate databases or get up to speed and start working on one.
Distributed system can be thought of as multiple participants acting as one for an external user. It may consist of multiple moving parts, taking different roles at different times, covering for one another, cooperating, relaying messages, storing information for other party temporarily, but in the end all these parts serve the same purpose.
In a distributed system, we have several participants (Processes) and Links between them. Sometimes, in order to simplify reasoning and make proofs, certain assumptions are made about quality of links, communication patterns and even time. Unfortunately, often these patterns are bleeding into our applications and cause problems when we don’t expect it.
One of the core concepts in distributed systems, along with replication and fault-tolerance is consensus, a way for multiple participants to decide on some value in a reliable way. Reliable is a somewhat overloaded theoretical term and one might make different reliability assumptions. In general, reliable means assurance: a way to know that anticipated things actually happen. Before we discuss ways to agree, let’s review multiple ways to communicate and reliability guarantees coming with them.
We first start with two processes, connected with a uni-directional link, which means that only process A can send messages to process B, but not vice versa. Our communication “ether” is imperfect, so messages might get lost or delayed. Let’s see what kind of guarantees we get. After the message M is sent, from the senders’ perspective, it might end up in multiple states:
- not yet delivered to the process B (but will be in some point in time)
- irrecoverably lost during transport
Notice that sender does not have any way to find out if the message is already delivered. In distributed systems terminology, this is a Fair-Loss Link: if both sender and recipient are correct, and sender keeps re-transmitting the data, the message is eventually delivered. It’s a useful abstraction and a first building block for building communication systems with strong guarantees. We can assume that this link will not be losing messages between communicating parties systematically. In real world, this might remind you of a UDP protocol, which lets you send separate messages but does not have delivery guarantees on a protocol level.
In order to improve the situation and get some more clarity in terms message status, let’s introduce a concept of “acknowledgement”. For that, we need to add sequence numbers to our messages to be able to distinguish one from the other. Now, process A can send a message M(n) where n is a monotonically increasing message counter. As soon as process B receives message M(n), it will send an acknowledgement ACK(n) send back to A, such as when B receives a message M(n). Acknowledgement might get lost on the way, just as the original message. This slightly changes the picture and now our state diagram is slightly different:
- not yet delivered to the process B (but will be in some point in time
- irrecoverably lost during transport
- (and newly added) is successfully delivered to the process B
This is still not enough to call our communication channel “reliable”. In order to improve our link and give some more guarantees, we can add retransmits to it. That is, if after process A sends message M(n), it waits until timeout T is triggered and, if an acknowledgement ACK(n) was not received it tries sending the same message again. Process repeats until message reaches it’s destination. Given link between processes stays intact and partition and packet loss between the processes is not infinite, we can state that the message is:
- not yet delivered to the process B
- is successfully delivered to the process B
In distributed systems terminology, this sort of abstraction is called Stubborn Link. In textbooks, it’s called stubborn, since sender keeps re-sending the message again and again infinitely, but, since this sort of abstraction would be highly impractical, we jump to the optimized version that sends messages only up until the acknowledgement arrives.
There are two more problems here, though: messages might arrive out-of-order and, because of re-transmits, some messages might end up being duplicated.
Since messages have a monotonically incrementing sequence number, receiver can track the highest sequence number n_consecutive, specifying a sequence number of the messages up to which all messages were received consecutively. If the next message is has non-consecutive sequence number, receiver puts it in the reordering buffer. Since we’re building on top of a fair-loss link, we assume that messages between n_consecutive and n_max_seen will eventually get delivered.
Similarly, de-duplication works by checking if the message with a sequence number `n` has already been passed down the stack by receiver and omitting already processed messages. This brings us to Exactly Once Great Debate, in which some parties claim that exactly-once delivery is possible (and use de-duplication as an example if the conversation is about links or transaction commit if the conversation is about broadcast). Different perspective describes possible pitfalls that are extremely important for understanding. Discussion most likely comes from approaching the problem from different protocol or abstraction levels and defining Delivery. It’s not possible to build a reliable link without ever transferring the message more than once but we can create an illusion on the sender side by avoiding processing the message more than once. If the effects of processing are only visible atomically and all partial processing results are completely discarded, even if multiple processing attempts were performed, for an external observer it looks as if processing was done only once.
Loosely speaking, from distributed systems perspective this kind of abstraction is called Perfect Link. This might remind you of TCP protocol. Of course, the model represented here is only a simplified representation for illustration purposes only. TCP has more sophisticated model of dealing with acknowledgements called Flow Control, that is used to group transmitted packets and reduce amount of protocol-level overhead.
Two Generals Problem
The simplest and one of the most prominent descriptions of the way to agree (or, in other words, to achieve the consensus), is described by the thought experiment widely known as Two Generals Problem. This thought experiment shows that it is impossible to achieve an agreement between two parties if communication failures are possible.
Imagine two armies, led by generals, preparing to attack a fortified city. Armies are located on the two sides of the city and can succeed in their siege only if their attack is synchronized. They can communicate by sending messengers and already have a devised attack plan. Now they only have to agree on the fact that they both will proceed with the attack, otherwise the attack can not succeed.
General A sends a message MSG(attack at 7PM) stating that their army will proceed with the attack. Once messenger is dispatched, A doesn’t know whether messenger has arrived or no. General B, upon receiving the message, has to send an acknowledgement ACK(MSG(attack at 7PM)). However, messenger carrying this acknowledgement might get captured or fail to deliver, so now B doesn’t have any way of knowing if the messenger has successfully delivered it. To be sure about it, B has to wait for a second-order acknowledgement ACK(ACK(MSG(attack at 7PM) stating that A had received an acknowledgement for the acknowledgement.
No amount of further confirmations can solve the problem, as the generals will be one ACK away from knowing if they can safely proceed with the attack. Generals are doomed to wonder if the message carrying this last acknowledgment has reached the destination.
One of the important characteristics of the distributed system is it’s timing assumptions. In an Asynchronous System, relative speeds of processes are not known, message delivery is not guaranteed in a bounded time or a particular order. Process might take indefinitely long to respond and process failures can’t be identified.
Main criticism of the asynchronous systems is that these assumptions are not realistic and processes can’t have indefinitely different processing speeds and links won’t take indefinitely long to deliver messages.
From the practical perspective, algorithms developed for asynchronous systems are very general and have stronger guarantees. In order to loosen up assumptions, system may be considered to be synchronous and introduce hard timeouts. This would make best-case scenarios more robust but if timing assumptions don’t hold up, algorithm will fail.
In order for processes to agree, several invariants have to be persevered:
- value that’s being agreed on has to be “proposed” by one of the participants
- all active (non-crashed) participants have to decide on the value
- value they eventually decide on has to be the same for all processes
In a paper by Fisher, Lynch and Paterson, famously known as FLP Impossibility Problem (derived from the first letters of authors’ last names), authors discuss a weak form of consensus in which processes start with an initial value and have to achieve an agreement on a new value. This new value has to be the same for all non-faulty processes.
Paper concludes that in an asynchronous system, no consensus protocol can be totally correct in presence of even a single fault. If we do not consider an upper time bound for process to complete the algorithm steps and if process failures can’t be reliably detected, FLP paper shows that there’s no deterministic algorithm to reach a consensus.
However, FLP proof does not mean we have to pack our things and go home, as reaching consensus is not possible. It only means that it can’t always be reached in bounded time. In practice, systems exhibit partial synchronicity, which puts partially synchronous system between the cases of asynchronous and synchronous ones. Nancy Lynch, one of the FLP proof authors, has later authored Consensus in the Presence of Partial Synchrony paper, where several partially synchronous models are discussed, one of them holding timing assumptions that are not known in advance and the other one, where timing assumptions are known, but start holding up at an unknown time.
This brings us to an important subject of Failure Detectors, which are widely used in practical consensus algorithms and help to solve consensus problem in a partial synchronous or synchronous system. Failure Detector is an abstraction that helps to reason about liveness in the system, detect and mark participants as active or failed.
If processes A and B communicate through perfect link and all process B stops receiving messages from A and A does not receive any messages from B, most of the time from the process perspective it’s impossible to know whether B has crashed, B is simply running very slow or there’s a network partition. If two processes are separated by the network partition, for both of them it will seem as if the other process just crashed.
The simplest way for a process to fail is Crash-Failure, where the process stops executing steps required by the algorithm. Here, the assumption is that processes are executing algorithm correctly, but stop at some point and never recover. In real-life system, this type of failure occurs less often than, say, Crash-Recovery, where the process stops executing steps required by the algorithm, but recovers at the later point and tries to execute further steps. For correctness, some algorithms still assume crashed and recovered process as failed and further steps do not influence the outcome of the algorithm.
This means that the algorithm should be designed in a way that does not rely for on process recovery for correctness, since it may never recover or recover too late.
Another type of failure is Omission Fault. This failure model assumes that the process omits some of the algorithm steps, is not able to execute algorithm steps or this execution is not visible for other participants.
The hardest failures to overcome are Arbitrary or Byzantine Failures, where the process continues executing algorithm steps, but in a way that contradicts the algorithm in some way (for example, by deciding on a value that was never proposed).
Let’s recap the material discussed in this article. We’ve discussed a concept of a Distributed System, which involves multiple participants, Processes, communicating through Links. First link discussed here is a Fair-Loss Link, that does not give any visibility in terms of delivery for a sender. Stubborn Links keep re-transmitting messages indefinitely, hoping that eventually one of the messages will get delivered. Perfect Links eventually deliver messages and make sure this delivery is done at most once. Two Generals problem tells us that two communicating processes will always be one step away from being certain that the other party has received a derivative acknowledgement.
After that, we covered a concept of Synchrony and discussed assumptions that distinguish Synchronous systems from Asynchronous and looked at the FLP Impossibility problem, that states that reaching a multi-party consensus in a asynchronous system is not possible and, in order for consensus to be reachable, we need Failure Detectors. Lastly, we’ve looked at different types of failures one faces when building a distributed system.
These concepts build up a foundation for reaching consensus and in future posts we’ll take a closer look at failure detectors, broadcast, multi-step transactions and consensus.
If you find anything to add or there’s an error in this post, do not hesitate to contact me, I’ll be happy to make corresponding updates.