A Guide to 99% Fault Tolerant Consensus, What Does that Really Mean?

by Ed Kazmierczak and Steve Melnikoff

the Penta Global Blockchain Foundation

Introduction: Penta Global was founded to be the most economically inclusive blockchain platform for the digital economy. Its ‘universal connector’ mission is to be a global network that interoperates and interconnects with other projects and off-chain systems. As part of this mission we keep a close watch on new developments and emerging ideas from the blockchain community, towards making positive contributions to meaningful areas of research.

Ethereum’s co-founder, Vitalik Buterin, recently posted an article entitled, “A Guide to 99% Fault Tolerant Consensus”. We are publishing a friendly response here in Medium as a way to extend the conversation across the wider blockchain community, in hopes of furthering discussion on this topic and encouraging a productive exchange of ideas. Developing a high performance consensus mechanism that is scalable and secure is a key challenge facing the blockchain community. Robust discourse will help drive the technical development process forward.

Buterin’s August 7th 99% fault tolerance article examines the idea of achieving high levels of fault tolerance using Byzantine Agreement. The Penta Development Team is not feeling exposed or embarrassed to acknowledge that the post ignited some intense argument in-house. Our internal debate centered on the ‘tangled wires’ of what Vitalik was specifically claiming, especially in relation to some of the nomenclature used. What follows is our own effort to approach ‘precision’ in the relevant terminology, to resolve what is being claimed, what can be claimed, and inviting the community to pursue a clearer understanding of distributed consensus based on fault tolerance.

The Byzantine agreement problem has a thirty plus year history. Some of the earliest solutions proposed to it are due to Lamport, Shostak and Pease (1980 & 1982), Dolev and Strong (1983), and Fischer, Lynch and Merritt (1986). Even today, a substantial international research community is highly active in this area. To understand the context of these first papers, let’s begin with some of the language and concepts used in fault tolerance.

Failure — A failure occurs when the observable behavior of the system (network) deviates from its specification or its intended behavior. For instance, a network can fail by adding a transaction with an erroneous amount to a blockchain, or by allowing malicious nodes to add spurious blocks to the network. Both are incorrect behaviors with respect to the network’s specification. It can also fail by allowing network-controlling cartels to form, which is incorrect behavior with respect to the intent of a decentralized network. Note that what may be defined, as incorrect behavior for a decentralized network, may actually be correct for a distributed one.

Fault — A fault is the underlying condition that leads to failure. A fault can be configuration flaws, a coding flaw, a design flaw, or other relating factors that, when triggered, cause the system to fail. It is possible that faults can lay dormant for years before the right conditions activate a fault and cause a failure.

Error — The error refers to difference between actual output and expected output.

An important goal of fault-tolerant systems is to make sure that any faults, dormant or otherwise, are trapped, avoided or appropriately managed before the system fails. Fault tolerance is generally a means to achieving reliability and availability, but before entering into that discussion let’s start with a formal definition of fault tolerance.

Fault Tolerance — A system is tolerant of m faults if the system can experience m faults before it fails.

Reliability — Technically, reliability is the probability of failure free operation in a given environment for a given time. Normally a ‘per operational hour’ failure rate is used to estimate software reliability, and in turn, the growth in reliability when faults are detected and removed. It is important to note that the definition of reliability is used on system failures — the observable behaviour of the system — and not on faults, which are their causes.

Availability — Related to reliability, availability is the probability that a system will function as required, when required.

While opinions vary, let’s assume good reliability starts at around four 9’s confidence. Which means a 99.9999% chance (a probability of 0.999999) of functioning failure free for a given time. For critical systems (nuclear reactors, aircraft, autonomous vehicles, space stations etc.), a minimum twelve 9’s reliability, or a 99.999999999999% chance (a probability of 0.99999999999999) of operating failure free is often required. As an example, given a blockchain platform processing around 1,000,000 transactions per second with twelve 9’s reliability, then ‘on average’ we could expect one failure approximately every 3.2 years.

Good systems availability can also start at around four 9’s confidence. That is, the system is ready to perform its function 99.9999% of the time, which means about one hour per year for the system to be unavailable. Critical systems often require higher availability.

The early academic papers on Byzantine agreement referred to in [2], [3] and [4] were written with a perspective towards developing systems tolerant of faults. Significantly, systems that will not fail even though faults exist and, or, may be triggered. To achieve system reliability of four 9’s or better, in case one or more of the network nodes fail or become faulty, redundant nodes replicating important processing can be added to the system.

Nodes can fail in multiple ways. Examples include, to fail silent (processor effectively dies), or to fail to terminate, producing a stream of constant values or a stream of seemingly random values (forever). Even more malicious challenges to fault tolerance are possible, including non-accidental coordinated attacks: 51% attack, hard fork attack, double-spending attack, Sybil attack, DoS attack, etc. Inherently hostile in nature, such attacks can disguise themselves as normal processing.

With this in mind the opening paragraphs of the ’99% Fault Tolerant Consensus’ can now be more concisely restated as: (1) it is possible to achieve consensus with n nodes of which m are faulty if

in a synchronous network, but if

then a 51% attack is possible; and (2) a reliability of over 99% is possible if observers also actively watch (monitor) the consensus. Buterin’s article utilizes Lamport, Shostak and Pease’s ’82 signed messages algorithm (affectionately called SM(m) [2]) but modifies it in two critical ways: (1) each round of message exchanges has a specific timeout so that any message reaching a node after the timeout is rejected; and (2) observers can participate in the consensus in order to improve reliability. However, note that both network nodes and observers can be faulty (or worse, infected with botnets or other malicious software designed to penetrate networks; see for example [8]).

The kinds of high reliability (the desired twelve 9’s) for critical systems like blockchain transaction processing is usually achieved through a combination of two key design features: (1) redundancy; and (2) a ‘no single point of failure’ design. This is exactly what decentralized chain replication gives us! The combination of redundancy and no single point of failure assures us that even if nodes fail the transaction ledger is not lost and will remain consistent. The next leading question is then, how best to achieve consensus among the distributed nodes with respect to the true state (stored blocks) of the chain at any given time and in a timely manner. Buterin’s paper proposes timeouts for each of the rounds which was suggested in [2] as a practical means of solving the Byzantine Generals problem in more realistic networks. Early work on implementing Byzantine agreement in practical systems (see for example, Castro and Liskov [9] ) showed that it was possible for an asynchronous system of n processors to tolerate

(strong) adversaries if messages are authenticated but that the number of message exchanges involved make the early systems difficult to scale up to the numbers of nodes required in blockchain consensus. Current research is looking and using probabilistic algorithms to reach consensus in large networks. How then to best implement efficient consensus in decentralized blockchain systems is an on-going and contentious question, but something that we at Penta are always enthusiastic to be a part of.

We hope this paper will help the blockchain community delve deeper into the fundamental questions related to fault tolerance, just as it has helped us reexamine our own assumptions regarding Byzantine agreement. Our desire is to inspire further analysis and inquiry into the technical problems impacting consensus models for blockchain networks. Scalable solutions to consensus achieving both high reliability and high availability are crucial challenges to overcome if we are to make blockchain applicable to real world scenarios.

Even more of a difficult problem is implementing really fast, robust consensus models in decentralised networks. Decentralisation remains a core tenet for Penta, and for the Blockchain movement in general. As a consequence, our efforts are focussed on successfully achieving technical solutions that support the decentralised consensus models that are fundamental to decentralised blockchains.

We look forward to engaging in the evolving dialogue on fault tolerance and consensus models, as the blockchain community moves incrementally closer to reaching consensus on consensus.

References

1. ‘A Guide to 99% Fault Tolerant Consensus’, https://vitalik.ca/general/2018/08/07/99_fault_tolerant.html, August 2018.

2. ‘The Byzantine Generals Problem’, Leslie Lamport, Robert Shostak, And Marshall Pease, research.microsoft.com/users/lamport/pubs/byz.pdf, July 1982.

3. ‘Authenticated Algorithms For Byzantine Agreement’, Danny Dolev and H. Raymond Strong, SIAM Journal on Computing, 12(4):656–666, 1983.

4. ‘Easy Impossibility Proofs For Distributed Consensus Problems’, Fischer MJ, Lynch NA and Merritt M, https://doi.org/10.1007/BF01843568, Distrib Comput 1: 26; 1986.

5. ‘Understanding Blockchain Fundamentals, Part 1: Byzantine Fault Tolerance’, Georgios Konstantopoulos, https://medium.com/loom-network/understanding-blockchain-fundamentals-part-1-byzantine-fault-tolerance-245f46fe8419, December 2017.

6. ‘The Byzantine Generals Problem’, Mark Nelson, https://marknelson.us/posts/2007/07/23/byzantine.html, July 2007.

7. ‘Advanced Blockchain Concepts For Beginners’, Coral Health, https://medium.com/@mycoralhealth/advanced-blockchain-concepts-for-beginners-32887202afad, May 2018.

8. ‘Hacking Blockchains with Smart Contracts to Control a Botnet’ https://www.esecurityplanet.com/hackers/hacking-blockchain-with-smart-contracts-to-control-a-botnet.html, November, 2017.

9. ‘Practical Byzantine Fault Tolerance’, Castro, M., and Liskov, B., Proceedings of the Third Symposium on Operating Systems Design and Implementation, New Orleans, USA, February 1999.