Breaking It Down: Byzantine Fault Tolerance

The ELI5 (…Or more like ELI17) you needed

Published in

Coinmonks

10 min readMay 27, 2018

As distributed ledger networks flourish, the average layman like you and I attempting to break into one of those dense whitepapers to get a better understanding will often have a hard time understanding all the esoteric terms.

Byzantine Fault Tolerance is one you’ve likely heard of pretty often; but what does it really mean when a distributed network is byzantine fault tolerant, and why is it so important that it’s considered to be a fundamental characteristic of blockchains?

Byzantine Fault Tolerance is based on the Byzantine Generals’ problem. This is a universal problem for systems that span a distributed network of elements, in which the failure of some elements isn’t able to be reliably communicated to the rest of the system. Basically, how do we reach consensus on a decision when some elements in the system aren’t communicating the correct information as they should?

Byzantine Faults are considered to be the hardest fault types within a distributed computing system, because instead of one element simply failing to work, it actually appears to be working to some elements, and not working to others.

Before we delve into a multilateral Byzantine Generals’ Problem, let’s start off with the classic Two Generals’ Problem.

Two Generals’ Problem

Imagine you have two armies 1 and 2 planning to attack a hostile city. If both armies and all their loyal soldiers attack at the same time, they’re guaranteed victory. If not, they will fail. Therefore, they have to reach consensus on just one simple decision: attack, or retreat.

Here’s the catch though: these two armies are separated by the very city they’re planning on taking over, and to communicate with each other, they have to send messengers through the hostile city. The messenger could easily be intercepted.

So to solve this problem we would usually go about the following:

First, Army 1 (A1) sends a message to Army 2 (A2) saying A1 will attack only if A2 confirms they’ve received the message and also agree to attack. A2 receives the message and sends a back a reply.

Simple, right? Not really: A2 still doesn’t know whether their message has really reached A1 because the message could be intercepted by hostile citizens. If A1 never received the confirmation message and A2 assumed they did, A2 will attack alone and therefore end up defeated.

So this is what A2 does to counteract this: in the confirmation reply, A2 states that if A1 sends another message saying they’ve seen A2’s confirmation, A2 will go ahead with the attack. The same issue of uncertainty of the message being received ensues.

You can probably already see the chain of confirmation messages stretching on forever and ever. This is what the “solution” really ends up looking like:

In fact, in 1975 it was already proven that the Two Generals’ Problem has no possible “perfect” solution.

But don’t be discouraged yet!

Let’s try and reformulate the problem multilaterally instead.

Say we had some number of generals n. All of these individual generals would have to communicate between each other in the same kind of remote fashion as in the Two Generals’ Problem and consequently achieve consensus in some manner. Furthermore, a certain number m of these generals must be traitors.

Basically, we just have to make sure all of the loyal generals do end up achieving consensus on a decision; we don’t really care about what the traitors do. We only care about what the traitor generals communicate to the other generals. If the traitors selectively communicate different messages about their decision to different generals, it will severely impact the army’s ability to make one coherent and coordinated decision.

So before we do anything, what’s the minimum number of loyal generals we need to be able to achieve consensus, anyways?

Well, obviously not just one general in total; then there’s no consensus to be achieved, because there’s only one person.

If there’s two individuals in total and one is a traitor, it still doesn’t matter because there’ s still just one single loyal general. No consensus. The traitor’s opinion ultimately does not count towards the final decision the army will make.

Now let’s look at three individuals in total. Let’s say we have one commander and two lieutenants, waiting to receive the commander’s decision in order to add it to their respective collections of received decisions from the whole army (the reason for this collection of decisions being that each individual needs to figure out what the consensus is based on the what the majority of decisions was).

In the above diagram, consider that we’re currently sitting in Lieutenant 1's perspective. We receive one order from the commander to “attack”. Since we don’t know whether the commander is loyal or not, what we’ve done is set up a system where the lieutenants exchange with each other what the commander told them. So L1 receives L2’s “confirmation” that they actually received the order to “retreat” from the commander. Wait, what?

So obviously from the diagram we already know L2 was in fact the traitor, but L1 doesn’t know any of this; they’re going in blind. Okay, before we continue, let’s look at the other possible side of this situation; the case where the commander was the traitor.

Taking in our results from this case, we see that poor L1 ends up receiving the same result no matter who the traitor was; their results will always either be [“attack”, “retreat”] or [“retreat”, “attack”].

Ultimately, it’ll be pretty much impossible to determine which message L1 received was truly valid.

This leads us to the current prevailing theory; the army cannot be traitor-resistant if 1/3 or more of the the individuals in the army are traitors. In other words, the army can only achieve consensus if less than 1/3 of the army are traitors.

Source

As an example to prove this theory, consider that we had one commander and 3 lieutenants this time. If only 1 of these individuals were a traitor it would be possible for the army to achieve consensus.

Say the commander were the traitor in this case. The commander would send a different order to each lieutenant in order to confuse them. We’ll name these commands x, y, and z. Using the same info-sharing confirmation method as before, since all the lieutenants are loyal, they will all share with each other the same commands x, y and z.

This creates the same collection of commands for each lieutenant, which would be some variation in order of [“x”, “y” , “z”]. You might be noting with alarm that all these commands are different, but in reality as long the majority of the lieutenants received the same collection of commands, they can achieve consensus. The lieutenants could simply set in advance some kind of rule like, “if you receive a set of three different commands, just default on retreating”.

Let’s also look at a case where a lieutenant was a traitor:

Since the commander was loyal and sent the same order to all the lieutenants, each lieutenant would receive at least one command v.

Furthermore, since 2/3 of the lieutenants are loyal, each lieutenant would have received at least two commands v. Finally, since the traitor L3 will derail from what the commander sent, they will send to the other two lieutenants L1 and L2 the order x.

As a result, the majority (2/3) of the lieutenants would have received the set of commands [“v”, “v”, “x”]. Since only the loyal lieutenants’ decision matters, they have reached consensus and will therefore follow command v.

Cool! Now since we have the proof that less than 1/3 traitors => consensus is possible, let’s move on to discuss how blockchains (specifically Proof-of-Work blockchains like Bitcoin) are able to become Byzantine Fault Tolerant (a.k.a be a “traitor-resistant” army).

Back to Bitcoin

In this e-mail written by Satoshi Nakamoto to James A. Donald, Satoshi eloquently explains how the Proof of Work consensus algorithm offers a probabilistic solution to the Byzantine Generals’ Problem. I strongly suggest you read this e-mail before you continue — it’s not long!

So to put this in another way, let’s say all parties in the Byzantine Generals’ Problem now have access to high amounts of computing power. This time, let’s focus on the problem through the perspective that the only thing the two armies have to do is reach consensus. That’s it. They don’t care about whether they attack, or retreat, or what time they attack, only that they all do the same thing at the same time.

With this, the generals decide that the first random command broadcasted to the rest of the army will be the one they follow. Unfortunately, of course, it’s not that simple; since all the generals are remote, they will receive different commands at different times depending on their location. So if General 1 broadcasts the message “attack at dawn” just a couple seconds before General 12 broadcasts the message “attack at midnight”, the generals closer to G12 will receive “attack at midnight” as their first command even though G1 was the first to broadcast a command.

To counteract this, the generals use their computing power to solve an extremely difficult cryptographic problem called a proof-of-work problem. The problem is so hard, it’ll take each general at least, for example, 10 minutes to solve it.

This proof-of-work problem forces each general to compute a hash at a certain level of difficulty from the first command they received. (Need a primer on what cryptographic hash functions are? Check this article out!) This level of difficulty is the same for all the generals. It essentially confirms that the message is truly from a general of their army (think of the difficulty as a unique wax emblem representing the army, the emblem being very difficult to copy).

As an example, let’s say all the generals have to solve this proof-of-work problem in such a way that the hash they obtain must have 5 zeroes in front of it: 00000e7f89a892… You can think of these 5 zeroes as the “emblem”, the difficulty, of the army. This type of hash is so difficult to obtain that it’ll take, as mentioned before, each general at least 10 minutes to solve it.

So how does it work? Let’s zoom in into one specific general’s perspective. Say G2 receives the message “attack at dawn” first. They get to work immediately, and add a certain number called a nonce after the message. Now it looks something like this: “attack at dawn 218ef761g3”. Using their computer, they feed this message into the cryptographic hash function, and obtain a hash that doesn’t have five zeroes in front of it. So, they change the nonce a little bit and do this whole process over and over again until they obtain a combination of the command “attack at dawn” and some nonce that gives them a hash with five zeroes at the front.

The first general that was able to obtain this hash successfully then broadcasts both this hash and their original message to the rest of the network. All the other loyal generals that receive it now include this proof-of-work into the hash they’re currently working on. If some generals were working on a different “first command” before, they drop it immediately to follow all the others, because now the proof-of-work chain with the command “attack at dawn” is longer (meaning you’ll be able to tell that majority of the army is now following this command).

After all the generals have added their proof-of-work, they will have all reached consensus on the plan of action to “attack at dawn”.

But wait — how are we sure none of the commands have been tampered with? Well, since all of the generals have access to the same copies of the proof-of-work ledger, a hostile citizen trying to alter the messages without the generals realizing it would have to go into the chain and alter each individual general’s proof-of-work, which would take 10 minutes * the number of generals in the army — by the time this citizen has finished editing the command, the generals would have already decided on a common plan of action.

And voilà! There you have it, the Byzantine Generals’ Problem and Byzantine Fault Tolerance in all its headache-inducing glory.

I hope this was helpful! Please let me know any questions you may have in the comments, and any feedback as well if I’ve misrepresented any information. Thank you for reading! Be sure to follow this medium page for more articles like this.

Breaking It Down: Byzantine Fault Tolerance

The ELI5 (…Or more like ELI17) you needed

Two Generals’ Problem

But don’t be discouraged yet!

Back to Bitcoin

Written by Ramy Zhang