Why N = 3f+1 in the Byzantine Fault Tolerance system

In a distributed system, it is not guaranteed that there is nothing wrong with the message sent by the other node. When a wrong node is sending wrong messages to everyone else, the problem can be easily solved by blocking the node itself.

However, when the right message is sent to partial nodes and wrong message sent to other, it becomes hard to find the problem because in a distributed system, each node doesn’t know the state of every other node. Nodes sending wrong messages to only some nodes is called the byzantine failure model.

Byzantine failure model is the most difficult model to solve when it comes to networks, and at the same time, a problem that absolutely must be solved. Especially in p2p, where other nodes cannot be trusted, byzantine failure model must be assumed before going on to resolve abnormal situations.

Just because a distributed system consists of only certified nodes, it does not mean that byzantine failure model assumption can be ignored. The node itself can be managed by a certified person, but it could have been hacked, sent the wrong message due to bugs, or there could have simply been some sort of hardware failure.

Byzantine fault tolerance (a.k.a. BFT) is a system that operates normally within a byzantine failure model. However, even BFT does not operate when there exists numerous faulty nodes. Thus when discussing BFT, the amount of faulty nodes that can be tolerated must be mentioned. For example, if N = 5f, even if ⅕ of the nodes suffer from byzantine failure, the entire system operates properly. Likewise, if N=3f+1, ⅓ of the nodes could suffer from byzantine failure and the entire system will still operate with no issues.

If we have two systems that are BFT, then obviously the one that can handle more faulty nodes is the superior system. However there is no system that assumes a faulty node of greater than ⅓ of the entire system. This is because theoretically, ⅓ is the greatest amount of faulty nodes that the system can handle.

There are 2 possible problems that can arise when a node undergoes byzantine failure. The first problem is not sending a message at all. More specifically, within a distributed system, in order for N nodes to function properly while having f nodes suffering from byzantine failure, a consensus has to be reached with N — f messages. Simply put, N — f nodes are required for quorum.

The second problem is when a node in byzantine failure maliciously sends different messages. In an extreme scenario, let’s say that among the N — f nodes that achieved quorum, f were sent by byzantine failure. Even in this case, the system has to operate normally, and thus (N — f) — f messages must be greater than f messages(sent by nodes suffering from byzantine failure).

In order to resolve the two problems above, (N — f) — f > f. N > 3f, which means when there are f nodes that has a byzantine failure, there has to be more than 3 f nodes in order for the system to be byzantine fault tolerant. The smallest N value here is 3f + 1. Thus, in a system that is made up of 3f + 1 nodes, the greatest amount of faulty nodes that can exist is f .

Originally published at medium on April 26, 2018.