Finding Consensus 1/4: Byzantine Fault Tolerance

Till Antonio Mahler
blockwhat?
Published in
8 min readJan 4, 2019
Photo by rawpixel

Hello there and welcome back to a new post here at blockwhat?, it’s a pleasure to have you here.

Today’s story marks the beginning of a mini series of four posts, that will explore the way consensus is found in blockchain networks — an absolutely fascinating aspect of this nascent technology!

Join me for a captivating first post in which we will get to know a concept known as Byzantine Fault Tolerance, figure out what this comprises, the history behind it and what kind of solutions have been found in that area — absolutely mesmerizing stuff!

Introduction

Photo by John Baker

This story revolves around something that’s at the core of our human existence and lays the basis for everything — figuring out what we actually want and making a decision.

It turns out that figuring out what we want (aka finding consensus) is also the essential glue that makes blockchains so unique and powerful. Blockchains are essentially ledgers that are recorded and updated by independent computers (aka nodes) that are dispersed all around the globe — for a visual impression check out the nodes that are currently part of the Bitcoin network for example.

Source

One of the most tricky parts is to get all those independent nodes to find consensus on what the correct state of the ledger is (e.g. who owns which Bitcoins).

In distributed systems like Bitcoin, one that is comprised of many different parts that communicate with each other and need to coordinate their actions, there is a pretty nasty class of failures that can occur and lead to potentially extremely severe outcomes.

This type of failures was first described by the three computer scientists Lamport, Shostak and Pease in their 1978 paper “Reaching Agreement in the Presence of Faults”.

Upon originally publishing their findings, they called this failure class within finding consensus Interactive Consistency and it was only four years later that they published the same idea under the now commonly known name The Byzantine Generals Problem.

The Byzantine Generals Problem

Photo by Dean Hinnant

In their 1982 paper titled “The Byzantine Generals Problem” Lamport, Shostak and Pease devised an illustrious analogy to demonstrate the severeness of this problem.

The situation they came up with comprises a Byzantine army which is camped outside all around an enemy city. Their army consists of many different units, which all have their own commanding general.

After having carefully assessed the strength of their enemy who is defending the city, they know that they can only win if all units attack at the same time — if not, defeat is ensured.

Since the city is quite large and the army is encamped far away from each others positions, the generals can only communicate by messenger.

The essential task is to find a common strategy and execute it together — attack or retreat? It’s very important to note that the nature of the plan doesn’t matter at all — but rather that consensus on one common plan is found.

So far so good, now we introduce some potentially problematic variables into the situation though. What do we do when not all the generals or their messengers are loyal and honest?

For one, image a traitorous general who sends contradicting messages to the other generals, telling some to attack and some to retreat.

There is another factor that makes this situation arguably worse — potentially traitorous messengers. Since the generals can’t communicate directly with one another, they rely on their messengers. Instead of the generals who act dishonest, it could also very well be that some messengers will communicate faulty messages.

How can the Byzantine army make sure that all the generals still accept the same plan in spite of this challenge?

Source

These type of faults are called Byzantine faults and they refer to any fault that presents different symptoms to different observers. In the case described above, this refers to generals receiving contradictory messages either due to a traitorous general or a traitorous messenger.

If a system service that requires consensus to be performed is lost due to a Byzantine fault, it’s called a Byzantine failure. In this case, the failure to reach a common plan that is carried out by all the generals is the failed system service.

A system that can tolerate some of these faults and still come to a consensus is called Byzantine fault tolerant.

This problem relates directly to modern day distributed systems — the different computers are the generals and the communication systems they rely on are the messengers.

In the next section we will explore the implication of all this in distributed computer systems.

Distributed Systems

Photo by Brandon Mowinkel

Byzantine faults are considered one of the most difficult class of failures in distributed computer systems.

Instead of a so-called fail-stop, where a node simply stops working and therefore ceases to send any data at all, in the case of a Byzantine failure the faulty node in question can still generate data and can appear to be correctly functioning to other nodes in the system. This can and does easily confuses fault detection systems, leading to severe outcomes.

It’s important to understand, that these types of failures are not necessarily due to security problems, but they can be due to simple electrical failures as well.

A big and problematic myth is that Byzantine faults are so rare that they will occur only in isolation. This is due to the common fallacy that assumes that random events are distributed uniformly in time — for example if we have a 1 in 500 year weather event this year, that it will take about another 500 years before it will happen again.

The reality in electronic distributed systems is rather that the conditions that cause an event can well persist, meaning that the probability of clustering is high!

In fact, there are many examples where this type of failure had or could’ve had disastrous outcomes — so many, that NASA published this report, in which they detail some of them regarding a whole fleet of expensive aircraft and their very own Space Shuttle program.

There are many systems, in which redundant system components alone are not enough to ensure maximum security though.

Take airplanes for example — Airbus and Boeing both incorporate a Byzantine fault tolerant system design, called SAFEbus. The SAFEbus recipient nodes each receive four copies, and only record the message if all four are identical. Each transmitter controls its partner’s drivers and will silence the whole node if informed of too many errors.

The same level of scrutiny and reliability is needed also in nuclear power plants and of course spacecrafts. SpaceX’s Dragon spacecraft for example employs a Byzantine fault tolerant flight system design, since a radiation event may change memory or register values, potentially leading to disastrous outcomes.

Of course not only correct input is important, but also the processing of that information needs to be redundant — if you’re interested to go deeper into this topic, explore common mode failures.

After having understood the importance of this concept, we will now take a look at how it is applied in the context of blockchain technology.

Consensus On Blockchains

Photo by Kenna Phillips

“A blockchain based system is as secure and robust as its consensus model.”

— Dr. Arati Baliga

All the properties of blockchains that make them so unique and extraordinary, like immutability and auditability, are only safeguarded when the underlying assumptions are correct — and in order for them to be correct, consensus about this assertion needs to be found.

When it comes to blockchains, there are many different ways of finding consensus. Some approaches that build on Byzantine fault tolerance, are the following.

One of the earliest and most famous ones is the Practical Byzantine Fault Tolerance protocol, which was written by Miguel Castro and Barbara Liskov in 1999.

This protocol enables Byzantine fault tolerance for distributed systems, one of the better known solutions who make use of this approach is HyperledgerFabric.

It’s important to point out that this protocol does not work for permission-less blockchains (e.g. like Bitcoin or Ethereum), but is rather employed by private blockchains. This protocol is very communication heavy and therefore not really suitable for large scale blockchains.

Another approach is known as Federated Byzantine Agreement, which entails a federated voting process. This solution is used by Stellar and Ripple, two big blockchains.

The most famous and widely used consensus method, which is called Proof of Work (and is employed by many public blockchains, such as Bitcoin, Ethereum), is interestingly enough technically seen not really Byzantine fault tolerant. Rather, it is a probabilistic solution to the problem — but more on this in the next post!

Many people, including Bitcoin’s inventor Satoshi Nakamoto, mistakenly describe that particular consensus mechanism as being Byzantine fault tolerant — if you want to read Nakamoto’s own take on this, I highly recommend you to read this email that he send out in late 2008.

So far so good for an introduction to the way consensus is found in a blockchain context — thank you very much for powering through this post, I hope you enjoyed it!

Please let me know if you have any questions, remarks or feedback, I’d gladly hear from you. If you liked this post, I’d be more than happy if you leave a clap (or two), share it with your friends and follow the future stories here on blockwhat?.

All the best

Till

PS: If you’re looking for helpful and great resources to learn more about blockchain’s paradigm shifting technological potential, check out these awesome resources.

--

--

Till Antonio Mahler
blockwhat?

Technology enthusiast from Berlin. Lover of random yet mesmerizing knowledge. Curious about all aspects of life.