Byzantine Failure — Why Blockchain Development is Difficult

Seung Woo Kim
CodeChain
Published in
6 min readMay 18, 2018

Since 2017, blockchain technology has become such a hot topic that it can be considered the trend of this era. As a result, even people who are not developers are discussing the endless possibilities of the blockchain. If you ask those same people why blockchains are difficult to grasp, they usually mention that blockchains are not just technological innovations, but are forms of money. However, this is far from the truth. The most famous blockchain system is known as Bitcoin, and since Bitcoin is a representative cryptocurrency, people usually have this misunderstanding. Blockchains do not have to be a form of currency.

If you ask someone who studied blockchains a bit further, they’d probably say that blockchains are difficult to develop because they require an incentive model. Blockchains are fundamentally made up of nodes that join the system voluntarily. Thus, in order to get these nodes to participate, an incentive model is necessary, which is more easily said than done because it steps foot within the realms of the human psyche. It is difficult to see through what a human being is thinking because most developers and computer do not possess psychic powers. Although developing an incentive model makes blockchain development challenging, it is not the fundamental reason that makes the entire development process a demanding one. Even without the incentive model, the blockchain system is a complicated one on its own.

In a distributed system, there exists numerous nodes, and through exchanging messages with one another, their states are changing. In this process of communication, we cannot assume that all nodes are working as intended. These kind of systems have a higher chance for problems to occur than systems where problems are resolved within a single node. Realistically, every problem cannot be solved, and thus, the system have a defined set of what kind of problems it can and cannot solve. If there are problems outside of this set, then the system is not guaranteed to work properly. The types of problems that the system can solve are defined within the system’s Failure Model.

Blockchains are difficult because it solves the hardest problems to solve in a distributed environment. Traditionally, when it comes to distributed systems, Failure Models are divided into six categories. Since these categories are hierarchical, the bigger category contains the smaller ones. Distributed systems set a Failure Model as an objective, and when a failure from a higher hierarchy occurs, it does not attempt to resolve it.

The simplest problem to solve is the Fail-stop Failure Model. In this model, the problematic node’s state no longer changes; however, information about the states are still returned. In a Fail-stop Model, all the requested messages arrive, and all the arrived messages are assumed to be valid, and thus, problems are solved easily. Realistically however, messages may never arrive, and this model does not consider the case where the server may crash. Thus, this model is not very commonly used.

Consequently, the next model considers crashes and it’s intuitively called the Crash Failure Model. In this model, every node that do not give a reply is assumed to be in a crashed state. These crashed nodes can be simply replaced with a new node and the problem can be resolved. However, even the Crash Failure Model is not realistic on an internet scale distributed system. The reasoning is that on an internet scale, a node may not reply for various reasons other than a crash. This is why the Crash Failure Model is used for communication between processors within multicore processors or distributed systems within a single data center.

The simplest model that can be used within an internet scale system is the Omission Failure Model. In this model, a specific unit of time T is defined, and all messages must arrive time T later. If it does not arrive in this time frame, then it is assumed that it will never arrive at all. This model simply represents a possible problem that can arise within a network.

However, during real world applications, a network can have more critical and complex issues. In reality, it is close to impossible to actually choose a maximum bound T for message arrival times. There may be problems that lead to slow message arrivals; however, a message just may arrive extremely late for other reasons not related to any type of failure. The Performance Failure Model considers exactly these type of circumstances. A distributed system that follows the Performance Failure Model must consider cases where the message does not arrive, or just arrives after a prolonged period of time.

All the problems discussed until now assumes that messages that actually arrived have no problems associated with them. In other words, the Fail-stop Failure Mode, Crash Failure Model, Omission Failure Model, and the Performance Failure Model trusts the messages that actually manage to arrive. However, as you’ve probably guessed it, there are times when even these messages cannot be trusted. For instance, a node may actually contain a bug, or due to hardware failure, the message may only have been partially sent. In the worst case, the message may be modified due to a Man-in-the-middle attack(a.k.a. MITM). In cases where there exists a node that consistently sends out wrong messages, the problem is easily solvable. However, if a node sends a mix of valid and invalid messages, the problem becomes much more difficult. A node that sends out such mixture of messages is called a byzantine node, and failures such as these are called byzantine failures.

Generally, it is difficult to handle every byzantine failure. Thus, the Authentication-detectable Byzantine Failure Model attempts to handle only those byzantine failures that can be verified by the system. This model solves the problem by integrating an error detection scheme or an asymmetric cryptography. Most internet scale distributed systems utilize the Authentication-detectable Byzantine Failure Model.

However, blockchains use a model that consider every byzantine failure, which is known as the Byzantine Failure Model. This is the hardest known failure model to utilize in a distributed system. Thus, most existing distributed systems have elected ignore byzantine failures. As a result, cases where the node shows irregular behavior due to unethical motives, being hacked by others, etc, were not considered.

The concept of blockchains is difficult to grasp because all blockchains utilize the Byzantine Failure Model. For instance, incentive model, which is difficult to implement, invites all participating nodes to not become a byzantine node, and the reason the consensus becomes difficult in blockchains is because it assumes byzantine failures. If blockchains stop utilizing the Byzantine Failure Model, then would the problem not become simpler?

Unfortunately, this is impossible because a blockchain is a network made up of self-participating, decentralized nodes. In decentralized networks, there are many reasons for causing a byzantine node. The most common reason is due to corrupt users. There are various reasons for their unscrupulous actions. These users may be making these attacks because it’s their hobby. Maybe it could be for personal economic gains. It could even be due to a personal vendetta. Whatever the reason, there has been many cases where another person attacks the system. In existing systems where users are not free to join a network as a node, people usually opt for a DoS attack. However, in a blockchain where anyone can join as a node, it is far more effective to join as a byzantine node to wreak more havoc upon the entire network.

However, byzantine nodes are not only created by evil users. It can also be done by lazy ones as well. In a centralized distributed system, upgrading is fast and easy. This is because the central administrator can upgrade the entire system him/herself. However, a decentralized network is different. Even though most nodes may have upgraded, a few may not be up to date due to inattentive users. In this case, the nodes that are out of date cannot understand messages coming from updated ones, causing a byzantine failure.

We have discussed the reasoning behind why blockchain development can be so difficult, and what exactly are Byzantine Failures. If you were to ask about blockchains to an engineer who has no interest in the subject, he/she would probably reply that blockchains are a type of distributed data store that stores all the history. However, this is true when looking at only a small aspect of blockchains. Rather, storing the history is a property that is used to recover from a byzantine failure. Blockchains must solve more fundamental problems than this.

I personally think that this difficulty is what makes blockchain development interesting. Things that were only being researched as theories are now being realized through actual implementation. In particular, the byzantine failure model is the failure model that most closely resembles reality. Consequently, distributed systems that are released in the future would most likely assume a Byzantine Failure Model. Therefore, even if you are not a blockchain developer, knowing what kind of research is being done in the blockchain field is probably important as a distributed system developer as well.

--

--