The Fault In Our States

Published in

MenaPay

10 min readSep 24, 2018

In order to work our way gently toward the inner machinery of a payments system based on a permissioned blockchain, it would probably help us to coarsely review some fundamental ideas behind fault tolerant systems first.

Replicating for Availability

Suppose we have a crucial network service that must always be available (for some definition of always.) The availability you get from your application running on one lousy server just doesn’t cut it, insist project stakeholders. You think then, well, we shall have two independently failing servers running our app and users will access whichever one they prefer. After all, redundancy is the beaten path to reliability.

However, you have a nagging question between your ears: will these two identical servers, handling requests in parallel, really behave the same as the single app behaves?

To remain consistent at all times, the two servers — let’s call them S1 and S2 — would have to communicate with each other about requested modifications. When a request comes in to S1 to change, say, the social network status of user Mike to “In the Maldives,” S1 would have to contact S2 before it implements the requested modification on its side. It would have to ask S2, “Yo, this dude wants to change his status to ‘In the Maldives.’ Are you cool with that?” Only when S2 considers and responds “Sure, let’s go ahead and make that change,” could both sides make the change and start letting the world know that Mike is chillin’ in the Maldives.

To be fair, nobody really cares about the service being perfectly consistent about Mike’s social status. It will not bring the end of the world if, for a limited period of time, some of Mike’s friends continue seeing the old status while others learn of his cool vacation. For a system tracking monetary accounts and payments or safety-critical life-and-death systems, however, being lax about consistency could mean some people going bankrupt soon or getting injured and others landing in jail.

Hot Backups Suffice Against Crashes

The servers having to agree among themselves before committing modifications is probably more of a hustle than you, the devops guy, had in mind when you simply wanted to boost your system’s availability. And if you wanted to have more replicas, servers S3, S4 etc., then reaching agreement could get quite more complicated. (We are using the terms server, replica and node interchangeably. We also don’t distinguish among requests, messages and commands.)

The difficulty here arises from the servers being equally accessible to clients. A simpler solution would allow the clients to interact with only one of the servers, say S1. S1 then would just keep informing S2, (and S3, S4 etc., if an even higher level of resilience is required) thus maintaining it as a hot backup ready to take the place of S1 as soon as someone somehow detects that S1 has crashed or is pining for the fjords. If that crash detection mechanism is simpler to implement than getting the servers to agree among themselves before every change — and it probably is — then this primary server with hot backups system (sometimes called primary backup or passive replication) should alleviate the availability concerns of your stakeholders as well as that uncomfortable feeling you had about consistency.

Resilience Against Mad Replicas

Suppose, following multiple reports from customers about strange behaviour in your service, the stakeholders are again concerned. Sure, you have your nifty hot backup system in place, but a little reflection convinces you that a primary server with hot backups is no good against crazy failures like returning different responses to different requests when the app is supposedly sitting in the same state. Something is wrong, but you don’t know what. Something is wrong, but the app is not just crashing, burning and stopping. You wish it did. The primary server, S1 in the discussion above, is misbehaving without tripping the crash detector. It could even be messing up the hot backups.

To attain some resilience against complex failure or malicious behaviour (often called Byzantine failure) perhaps we should go back to the configuration where all the servers are accessible to the clients. There will be no primary server and hot backups any more. Let’s see if we can tackle the complexities of reaching agreement among the servers in such a symmetrical case.

Hear Ye, Hear Ye!

To be able to reason sanely about this issue, we will require each client to send each one of its requests to all the servers. This admittedly puts a significant extra burden on the clients. In practice, perhaps a proxy could handle this responsibility, at the cost of adding a new point of failure. (Such a proxy would have to be a very simple, heavily battle-tested, tiny app.) On a large network, it would be impractical for the client (or a proxy) to connect directly to all the replicas. In that case, the client would contact a small number of the servers and a subsystem would have to be devised for that set of initial receivers to further disseminate the request throughout the network.

We will also assume that if multiple independent good copies of your app start in the same state and receive the same sequence of requests, they will all end up in the same state. While it may initially sound easy, this condition of determinism can be surprisingly difficult to achieve and verify in practice.

Order In The Courtroom!

Even with these two conditions in place, there is still no guarantee that all the good replicas will always be in the same state (or produce the same output). The problem is that when multiple clients make requests within a short period of time, different servers might receive those requests in different orders. The same requests executed in different orders will not necessarily cause a deterministic app to end up in the same state. For example, the two requests:

Here, deposit $100 into my account (there are only pennies there now)
Please send my landlady Ms. Smith $95 from my account

need to be executed in that order, otherwise Ms. Smith will receive nothing. We need to find a way to make sure not only that all the good servers receive all the requests, but also that they receive them in the same order or at least come to an agreement on that order without much delay.

For example, in the Bitcoin network, the whole point for a miner of broadcasting a signed block of transactions is to help all the good nodes agree on the order of the transactions. When a miner broadcasts such a block, it is saying to them, in effect:

Hey, I know all of you guys have probably received the transactions in this block already, but God knows in what jumbled order you have received them. Here, since I spent all this electricity on my expensive specialized computing tower and solved the holy mining puzzle for this block — as you can easily verify for yourselves — I am in a position to call the shots now. The order of transactions you see in this block is the order that all of you shall obey.

And obey they do, before starting the race for the next block.

All Agreed, Then?

It turns out that, for most systems, the problem of reaching consensus among the replicas and the problem of broadcasting clients’ requests to the replicas in total order are equivalent problems. Equivalence means that if we have a solution to either one of these problems, then that solution can be used to solve the other problem easily.

While we are at it, here are two other interesting facts about distributed systems:
First, it is not possible to build a reliable distributed computing system on arbitrarily unreliable communication channels. Of course, we cannot expect a communication channel to be perfect, either — there is no such thing — but we need to have a guarantee that if we keep resending the same message, it will be received with a given probability and we will eventually receive an acknowledgement, too.
Second, it is not possible for the servers to reach consensus in a fault tolerant manner without synchronous operation (where all the servers maintain synchronized clocks) or at least time limits on operations. This is true basically because without time limits, it is not possible to distinguish a crashed unresponsive server from a merely slow one.

Finally, then, we are able to describe an infrastructure on which we can try to build a system that is resilient against crazy failures. We have replicas of a deterministic app starting out in a known initial state. The clients send each request to all of these replicas, and all the replicas receive the same sequence of requests from client space — they somehow agree on the total order of the requests before they execute them. This setup is sometimes called active replication or the order-execute architecture.

With this active replication setup, all that the client has to do is accept the majority opinion. If we want our system to be resilient against two crazy replicas, for example, then we will have five replicas and as soon as a client receives three identical responses, it will be able to trust that response.

Incidentally, if we weren’t concerned about crazy failures but just crashes, then a client would know the response to be good as soon as it received the first one. Three servers would be resilient against two crash-only failures. Of course, as we saw before, crash-only failures can be handled by passive replication, too.

The upshot here is that if we want to be able to tolerate f servers in an ensemble of replicas failing in arbitrary, complex ways, then we need to have at least 2f+1 servers in the request-execution phase (the phase that starts after the requests have already been ordered somehow).

Now let’s concentrate on that ordering requirement, because pixies and elves sure ain’t ordering our messages for us. Ideally, running 2f+1 servers in the ordering phase would be sufficient as well, so that we could say that 2f+1 replicas suffice for the overall Byzantine fault-tolerant system. Let’s see if this is indeed the case.

When A Majority Is Not Enough

To be able to discuss the ordering problem, we can reduce it to the problem of a group of servers reaching consensus on the sequence number of a given message — a humble integer. (Now, that feels simpler, doesn’t it?) All we have to do is understand how a group of servers can come to an agreement on an integer when a certain number of those servers could be two-faced double-dealing backstabbers.

In such a situation, can the servers first of all reach an agreement on what everyone else thinks? If we had that, deciding which integer to pick could be stated as a simple rule. The servers could conceivably just pick the greatest number among the proposed values, for instance.

When each server tells all the others what it thinks the sequence number should be, all the servers will have a list of which server thinks what. However, since some of the servers could be lying, there is no guarantee that those lists will be one and the same. There has to be at least one more round of communication where each server shares with all the other servers the result of the first round seen from its own perspective.

If we want:

All the good servers to agree on the exact same list of which server prefers which sequence number
That the entries in the list corresponding to the good servers are actually what those good servers think the sequence number should be

then it can be shown rigorously that a total number of 3f+1 replicas would be required to be able to tolerate f evil replicas. A simple majority is not enough. For example, if we want our system to be resilient against two crazy replicas, then we need to set up an ensemble of at least seven replicas to achieve a total ordering of incoming requests. 3f+1 replicas are required even though we are not even trying to determine who the bad guys are.

The upshot here is that if we want to be able to tolerate f servers failing in arbitrary, complex ways, then we need to have at least 3f+1 servers in the ordering phase.

All This Says Nothing About Performance

Taking a step back and reviewing what all these algorithms are about, we need to remind ourselves that our aim was to improve availability in the first place. All the extra servers do not improve performance, basically because each node executes the same sequence of requests. There is no parallelism.

A Nanoplay

SCENE: An enterprise data center
Flourish. Enter Blockchain, Chaincode

BLOCKCHAIN
Another general payment!
I do believe that these transactions are
For some new honours that are heap’d on Central Bank.

CHAINCODE
Why, man, he doth bestride the narrow internet
Like a Volcker, and we petty apps
Walk under his huge legs and ping about
To find ourselves dishonourable containers.
Apps at some time are masters of their state machines:
The fault, dear Blockchain, is not in our states,
But in ourselves, that we are underlings.
Blockchain and Chaincode: what should be in that Central Bank?
Why should that name be sounded more than yours?
Write them together, yours is as fair a name;
Sound them, it doth become the mouth as well;
Weigh them, it is as heavy; conjure with ‘em,
Blockchain will start mining as soon as Central Bank.
Now, in the names of all CS professors at once,
Upon what energy doth this our Central Bank feed,
That he is grown so great?

Exeunt