Fault Tolerance in a distributed system forming a blockchain

Published in

GAME

14 min readNov 1, 2018

Over the past two articles about distributed system, We have explained how to create a high-quality distributed system and blockchain.

Synchronization between nodes in a distributed system forming a blockchain
https://medium.com/mold-project/synchronization-609369558ce7
“Consistency and Duplication in a distributed system (What is the protocol MOLD needs?)”
https://medium.com/mold-project/consistency-e3e0fe41358d

Distributed systems are essential concepts for achieving high scalability, locality, and availability. On the other hand, however, a lot of ingenuity is required for the entire system to look consistent when viewed from the client. In addition, it is said that it is almost impossible to construct a distributed system with complete features, and it is necessary to select which performance should be emphasized by the application.
In addition to describing the characteristics of these distributed systems, we have also described the characteristic properties of blockchains with high performance. Finally, by summarizing the fault tolerance property, we will explore further greater potential that the blockchain have and would like to explain comprehensively the system that MOLD should aim for through discussion of each advanced blockchain project such as Tendermint.

1. Introduction (Outline of Fault Tolerance and Overall Flow)

Unlike a single system, distributed systems have partial failures. Overall failure of a single system tends to make the whole system down. On the other hand, in a partial failure, the system can continue to operate while recovering from a partial failure without seriously affecting the overall performance.

In this article, in following order, we will explain fault tolerance; a system can continue processing even if a part of the system fails.

What kind of properties will be fault tolerant
What kind of failure there are and how they can be classified
How fault tolerance is actually realized in a distributed system
About communication failure
“Reliable multicast” that increase process’s resistence
About Distributed Commit Problem

2. What is fault tolerance?

Fault tolerance

Fault tolerance is defined as follows

The ability to endure service even if failure occurs

In addition, a system with fault tolerance is sometimes called a high dependability system, and requirements related to dependability system are classified into the following four.

Failure model

Typical failure for processes in a distributed system are the following four:

Faults for a communication link are classisied as well.

For Byzantine failures, for example, delivery of false messages etc may occur, so it is the most bad and difficult to deal with.

Failure can be hidden by redundancy. This is easy to understand, for example considering that mammals have two eyes, ears, and lungs. Even if some of these distributed organs fail, you can use the system while hiding the breakdown. This is called physical redundancy. There are three types of redundancy: information redundancy, time redundancy, and physical redundancy.

3. Process resilience

Following the description of fault tolerance, we consider how fault tolerance is realized.

Process Replication

A typical method is process replication. Creating (duplicating) the same process in a group is called Replication. By replicating in the distributed system, it is possible to provide a service by a normal process even in case of a partial failure. We call a replicated process a replica.

As mentioned in the previous article on consistency, (https://medium.com/old-project/consistency-e3e0fe41358d)
There are two approaches to multiplexing (duplication) as follows.

Primary base protocol (Passive Replication)
Duplicate write protocol (Positive Replicationl)

In the forme one, only the primary replica handles messages from clients, and the other replicas back up the main processes. While there is no inconsistency in processing results between replicas and implementation of communication functions is easier, selection algorithms are required for failure of primary replicas, and the processing is somewhat complicated.

In the latter case, all replicas receive and process messages from clients. At this time, two properties of total ordering and atomicity are required for processing based on the message. Therefore, atomic multicastrequires more complicated communication function.

k Fault tolerance

In duplicate write protocol, it is said to have k fault tolerance, that k components move properly even if they fail. If you have a Byzantine fault, you need at least 2k + 1 processes to have k fault tolerance.

Atomic Multicast Problem

As a premise of the above replication model, there is a condition that all requests must arrive in all servers in the same order. This is called atomic multicast problem. This will be discussed in more detail in Chapter 5.

Agreement between processes

The problem of agreement between processes is fundamental and important for giving distributed systems fault tolerance. The purpose of the distributed agreement algorithm is to reach consensus in a finite number of steps for processes that are not failing among themselves, and there is a problem of General Byzantine in representative ones.

General Problem of Byzantine

In a system with k faulty processes, agreement is reached only when there are 2k + 1 or more normal processes and there are N =< 3k + 1 processes as a whole. In other words, agreement is only possible if more than two thirds processes are working correctly. (If it is less than that, it may be deceived by a failing process.)

**Addition: About the number of normal nodes required for fault tolerance

With many protocols, the maximum allowable number of nodes with Byzantine obstruction is said to be 1/3. The reason will be briefly described below.

Let “N” be the total number of nodes, “F” byzantine nodes, and “T” the number of nodes required to normally consensus.

For example, suppose that normal nodes of “N — F” are divided into the same number, and the number is expressed as follows.

(N−F) / 2

Since the Byzantine node of “F” has arbitrary behavior, in order to take consensus normally, it is necessary to satisfy the following expression.

T> (N−F)/2 + F ⋯①

Also, considering the case where all the Byzantine nodes of F are offline, the consensus can be taken by other normal nodes, so the following expression holds.

N−F ≧ T ⋯②

From ① · ②,

N−F > (N−F)/2 + F
∴F < N3

Based on the above, when the number of Byzantine nodes among the total nodes is less than 1/3, consensus can be taken normally.

4. Reliable client-server communication

So far, we discussed the fault-tolerance of processes in distributed systems and learned about replication. This chapter discusses the introduction of fault tolerance on communication link.

P2P communication

The basis of communication in a distributed system is point-to-point communication (one-to-one communication) connecting one process and another process.

TCP

TCP: Point-to-point communication that enables reliable communication
TCP has a mechanism such as sequence number, timer, checksum, acknowledgment, retransmission control, congestion control and so on. For example, an omission failure due to a missing message can be dealt with by an acknowledgment including a TCP sequence number and retransmission control based on the acknowledgment.

RPC (Remote Procedure Call) in the case of a fault

The purpose of RPC is to realize interprocess communication without being conscious of the communication part by the form of local procedure call. There are five obstacles that can occur in a distributed system using RPC.

The client can not locate the server.
The request message from the client to the server is lost.
The server crashes after receiving a request.
The response message from the server to the client is lost.
A failure occurs after transmitting a request message at the client.

As a countermeasure to each, there is a method of setting exception processing and a timer (time limit).

5. Reliable group communication

We focused on one-to-one communication in the previous chapter, so here we explain about high reliability of one-to-many multicast communication. In a distributed system, it is important that messages are sent without leakage including the order to each other’s servers.

Reliable multicast in the absence of failure

Consider delivering messages to each member in order.

The sender first saves the multicast message in the history memory at hand. Also, the sender receives a transmission confirmation notice (ACK) from the receiver. In the ACK, the last message identifier completed transmission is entered and returned. If the ACK containing the expected identifier can not be received due to message loss or the like, the sender retransmits the message.

Reliable multicast in case of failure

If a process fails in a distributed system, two guarantees are important.

Ensure that the message from the sender is delivered to the whole process or not delivered at all.
Assurance that messages from senders are delivered to all processes in the same order.

In a distributed system, not “a process”
Reliable multicast with the property that “when” sender “during message delivery fails, that message is delivered to all remaining processes or ignored” is called virtual synchronization .

Also, communication that is virtual synchronization and carries out message delivery in total order is called atomic multicast.

One implementation example of virtual synchronization is Isis. Isis keeps and transfers mmessage M to process until it knows that all members have received message M.

6. Distributed Commit

The problem that generalizes atomic multicast problem is called distributed commit problem.

Atom Commit

It is necessary to consistently judge that different site-like processes consistently commit or abort. Such an operation is called atomic commit.

6–1. Two phase commit

Two-phase commit protocol (2PC) is a typical method to realize atomic commit. As the name suggests, each phase consists of two steps and is organized as follows.

(Phase 1 【Voting phase】)

The coordinator sends a VOTE_REQUEST message to all participants
The participant who received the VOTE_REQUEST message sends a VOTE_COMMT message to the coordinator if it can commit its transaction and votes by sending a VOTE_ABORT message if it needs to abort.

(Phase 2 [commit phase])

3. The coordinator gathers votes from all participants. If all votes are COMMIT, we commit themselves and send GLOBAL_COMMIT message to all participants. If ABORT even more than one, it decides to abort the transaction and sends a GLOBAL_ABORT message.

4. The participant waits for a message from the coordinator, if it is GLOBAL_COMMIT locally, it commits, if it is GLOBAL_ABORT it discards the transaction.

Throughout, the coordinator and the participants make state transitions as follows.

SKEEN, D “Nonblocking Commit Protocols.” Proc. SIGMOD Int’l Conf. On Management Of Data. ACM, 1981

Blocking commit protocol

There is a big problem with the above two phase commit protocol. When the coordinator fails in Phase 3 and all participants are waiting for messages from the coordinator. , Participants can not decide cooperatively the decision of the action which should be finally taken. From this, two-phase commit is said to be a blocking commit protocol.

Actually, blocking itself in 2-phase commit rarely occurs, so it is not used much, but 3-phase commit protocol is devised as a solution to avoid blocking.

6–2. Three-phase commit

Unlike the two-phase commit protocol, the three-phase commit protocol satisfies the following two conditions. It is indicated by [Skeen and Stonebraker, 1983] that these two conditions are necessary and sufficient for a commit protocol without blocking.

There is no such situation as going directly to COMMIT state or ABORT state.
There is no possibility of making a final decision and there is no such state as transitioning to the COMMIT state.

SKEEN, D. and STONEBRAKER, M “A Formal Model of Crash Recovery in a Distributed System.” IEEE Trans. Softw. Eng., Mar. 1983

Specifically, a PRECOMMIT state is provided between two phases of two-phase commit.
Throughout the participants and the coordinator change state as follows.

DISTRIBUTED SYSTEMS “Principles and Paradigms” Chapter7 CONSISTENCY AND REPLICATION / Andrew S.Tanenbaum, Maarten Van SteenX

The big difference from two phase commit is that all processes return to INIT, ABORT, PRECOMMIT state. Since it never stays in the READY state, the remaining process always makes a final decision and can act as a non-blocking protocol.

The three-phase commit is merely a concept presentation, and there is no mechanism yet to work properly even if a coordinator fails. However, after the appearance of blockchain, its history will move greatly. The Tendermint project realizes the non-blocking protocol by adopting three-phase commit in the block chain. The details of tendermint will be explained at the end of this article.

7. Fault Tolerance in Block Chain

Finally, based on the above, we will also refer to the fault tolerance in the distributed blockchain system.

7–1. Blockchain fault tolerance

The fault tolerance of the blockchain is high. Let’s take a closer look at the nature of the blockchain based on the four high requirement of dependability classified in Chapter 2.

The time and number that a blockchain system stops functioning is small. Especially in the Bitcoin network, it can be said that there are rarely high availability and reliability in that it realizes zero downtime and continues to operate normally even if some nodes are out of order.
Next, regarding safety, when the system is not operating properly in a blockchain network, problems like “Transactions are not processed and clogged”, “Information is not shared between nodes in the network and get the blockhain forked” will arise. The latter problem is highly likely to lead to major troubles.
Regarding maintainability, it can be said that communities are easy to divide in case public blockchains like Bitcoin, and recovery from it is difficult. The Bitcoin network can be highly appreciated in that it has high availability and reliability so that there is no need for recovery, but if you want to have maintainability you should consider choosing a private chain or consortium chain.

Also, the blockchain is very meaningful in that it presents effective solutions for byzantine fault, which are considered to be the most difficult to deal with. Specifically, it is a consensus algorithm typified by PoW etc… PoW deal with the Byzantine general problem by forming an incentive structure; argorithm that miner cam gain more profit by maintaining / contributing rather than actions that destroy the network based on game theory. It should be noted that new problems such as hard forks are occurring, however, it can be said that it has achieved certain success. Besides, the PBFT adopted by Hyperledger also achieves high Byzantine fault tolerance by setting leader node confirming the vote.

7–2. Blcokchain Process resilience

Consider how fault tolerance is realized following the description of fault tolerance.

First, there were two approaches to process replication.

Primary base protocol
Duplicate write protocol

A primary one that adopts the primary base protocol of 1 is a blockchain based on the PoW consensus algorithm. In the case of PoW, it is the specification of the local write protocol, among the primary base. Miner who succeeded in finding the nonce value of PoW as the exclusive control (leader selection algorithm) gains the right to add the block as the primary server. However, when a node with the right to become the primary server appears simultaneously, the blockchain forks.

On the other hand, the one that adopts the duplicate write protocol of 2 is the blockchain based on PBFT. Various PBFT-based consensus algorithms including Tendermint do not have a primary server that first executes updating of each data responsibly, and all participating nodes can perform write operations in the same period. That is, it can be said that the PBFT type consistency protocol is similar to the active replication protocol of the duplicate write type.

Details of these consistency protocols are summarized in more detail in an article on consistency in distributed systems (https://medium.com/mold-project/consistency-e3e0fe41358d).

7–3. Blockchain high reliability communication

I have mentioned the process of blockchain, but this time I will focus on the communication link.

In blockchain, each node participating in the network performs P2P communication and shares data. In addition, the primary server selected by the leader selection algorithm performs multicast in order to share information of a newly added block to each participating node, for example, when a nonce is found. At this time, it is important to realize atomic multicast, which is virtual synchronization and carries out message delivery in total order, considering the case where a failure occurs in a communication link or a node.

So, how is the atomic multicast problem and the distributed commit problem solved in blockchain?

In public chains adopting PoW like Bitcoin, atomic multicast has not been realized. Therefore, frequent forks can occur. Since each node shares data correctly over time, consistency is established, but it takes more than 10 minutes to confirm that the transaction is stored in the block.

Here, We would like to pay attention to the Tendermint consensus algorithm. In general, there is a 2PC(two-phase commit) as a method to realize atomic commit, and a 3PC method as an improved version has been proposed, but both were incomplete. Therefore, Tendermint realized atomic commit by blending the blockchain with the 3PC method and adding constraints on the node under the round robin method. I will explain the approach to this exciting new innovative distributed commit problem in the next chapter.

7–4. Distributed Commit in Tendermint (Innovative Three-Phase Commit Model)

First, Tendermint is PBFT type. In Hyperledger, the validator as a leader is always the same process, but Tendermint has a leader selection algorithm, and a leader is determined deterministically by the round robin method. The leader collectively proposes the next block of transactions stored in mempool. With this proposal, the Tendermint consensus implements 3PC(three phase commit) and realizes atomic multicast. The Tendermint consensus algorithm can be roughly divided into three states.

Propose
Proposal of block by validator set deterministically selected by round-robin method by leader selection algorithm based on stake quantity. Start of voting in this state.
PRE-VOTE
The first vote for the proposed block. As soon as two-thirds or more of approval is obtained, we will move on to the next step, but wait until the limit time until all votes are gathered. Because of this time limit, it can be said that Tendermint is a partial asynchronous consensus algorithm. In addition, this voting algorithm has 1 / 3k fault tolerance. (Mentioned in Chapter 3)
PRE-COMMIT
The second voting for the block agreed by more than 2/3 in PRE-VOTE. At this time, as described below, Tendermint’s smart part is a measure when the votes of 2/3 or more are not gathered.

As mentioned in Chapter 6, by setting the PRECOMMIT phase for three-phase commit, it was possible to realize the blocking protocol if the following conditions are satisfied.

There is no state that directly transits to COMMIT state or ABORT state
There is no possibility of making a final decision and there is no such state as transitioning to the COMMIT state.

In Tendermint, the validator voted in the second voting phase, Pre-Commit, is locked and can only vote for locked blocks or blocks with more than 2/3 votes in Pre-Vote. By the treatment of locking, the above two conditions are satisfied. In other words, since each validator can only vote in Pre-Commit to one block at all times, it realizes no fork mechanism.

In other words, “Tendermint consensus ensures that the operation of adding blocks is done on all nodes in the network, or no nodes at all; the next generation consensus protocol that realized the finality.

Tendermint Documents “https://tendermint.readthedocs.io/en/master/introduction.html"

— — — — — — — — — — — — — — — -
Cosmos Gaming Hub Project(Former MOLD project)
CEO & Co-Founder

Takumi Asano

For all game enthusiasts