Consensus in Distributed systems: An overview

Srikant Viswanath
Architectural Tapas
2 min readJun 21, 2020

Consensus is one of the most important and fundamental problems in the distributed systems that seems deceptively simple.

In this post we will take a look at what it is and what are the criteria for a solution.

What is it?

The problem is simple — Getting several nodes to agree on something.

Consensus algorithms are used to determine which one the mutually incompatible operations should be the winner.

Some end user use cases:

  • Several people grabbing for that last seat on the airplane
  • Multiple users trying to grab the same username — Note that this is subtly different problem to checking if a username is taken which is usually solved using bloom filters

Time to get formal

Lets take a closer and formal look at the above problem

One or more nodes proposes values and the consensus algorithm settles down on one of those values

Any consensus algorithm formally must satisfy all of the following characteristics:

Integrity: No node decides twice

Uniform agreement: No two nodes eventually settle on different values

The above 2 define the core idea of consensus — everybody decides on the same outcome followed by the fact that they cannot change their mind once settled on a value.

Validity: If a value was settled upon, it must have been proposed by some node

What if the algorithm always decides on a trivial value like null ? Validity property safe guards against it.

Integrity, Uniform agreement and Validity properties are usually group together as Safety properties

Termination: Every live node must decide on a value, i.e, even if it might have different opinions earlier, once consensus arrived it must abide by it.

This captures the essence of fault tolerance — a consensus algorithm cannot simply wait forever and do nothing, it must make progress. If one node fails, other nodes must reach a consensus anyway.

We can quickly realize how the 2 phased commit(2PC) violates termination property. In other words:

Consensus model assumes that when a node crashes — it is not coming back. Any algorithm that waits for a crashed node to recover does not fit termination property and thus consensus algorithms’ criteria

It is not going to work if we take this idea to the extreme, i.e., no consensus would be arrived at if all nodes failed. Of course, there is a minimum number of nodes that need to be alive which is referred to as quorum

An entire subset of consensus algorithms are called fault tolerant consensus algorithms which ensure termination property is satisfied. This is too deep a topic to warrant its own post.

Popular fault tolerant consensus algorithms are — View stamped replication, Paxos, Raft, Zab

--

--