Gossiping algorithms: Even they gossip, that too in an amazing and fascinating way

Don't remember the exact problem mentioned by that senior engineer in a particular tech talk, which they faced while working on some product. But it drilled down to something: 
Given a cluster of nodes or servers, how to efficiently get the information about current status of a particular node .i.e is it up and running, dead, active, inactive etc. 
As far I remember, the bigger picture was data retrieval or update from a distributed cluster of thousands of nodes, which came down to a particular problem statement failure detection in a really large cluster of nodes.

Failure detection in Distributed systems

Given that you have multiple servers and clients in a cluster working together for serving a definite purpose be it a database, search engine, object storage, a load balancer or maybe something else.

Failure detection is one of the challenges which always crops up and has to be addressed. Different architectures address it differently.

But the very first approach which could be thought of is to use centralized monitoring for the cluster(distributed system), 
which does seem a fair choice if you can afford a single point of failure and certainly, there are other issues apart from availability, latency may be one of them.

It may not be the obvious choice when your scale is of 10 raised to powers of 3 or 4, or way beyond that because then things start getting complex.

When you need to scale

A single point of failure is hardly an option in any system, doesn’t matter the scale. 
Optimized use of various resources like network bandwidth, memory, CPU utilization, and latency is a must.

Although there may be multiple solutions and approaches to address the problem for failure detection, Gossiping algorithms/protocols in such use cases prove to be really effective and efficient (which was the solution used by them as well for their purpose).

The major advantage of these is the Distributed monitoring, preventing a single source of information and a single point of failure both.

What exactly are Gossiping protocols

The super cool and super simple concept is: 
Each node or server in cluster sends whatever information it has about the cluster state, to other nodes in the cluster on continuous intervals.
This way the information spreads like an epidemic or rumor, that's why its also termed as Epidemic protocol sometimes. 
This sort of setup has eventual consistency .i.e one will get correct information but with some delay.
Although gossip protocols have fast convergence.

1000 feet view of Gossiping protocols

  • In the cluster, every member keeps a list of known members, their addresses, an integer which is the heartbeat counter.
  • Every T seconds each member updates its list of neighbors’ heartbeat counters based on health pings checks or pings and 
    selects other members at random to send the updated information to.
  • Upon receipt of such a gossip message, a node merges the list in the message with its own list and adopts the max heartbeat counter for each member.
  • As long as heartbeat counter keeps on increasing for a node, it is guaranteed to be healthy and is considered dead if heartbeat hasn’t increased for more than TFail seconds.

The mechanism can be implemented easily by maintaining some straightforward information on every node:

  • Healthy node list: Nodes which are assumed to be up and working as per current information.
  • Suspected node list: Nodes which may be down as per current information, i.e ones whose heartbeat hasn’t increased for last TFail seconds. They are meant to be forgotten and deemed dead after TForgot seconds.
  • Heartbeat counters list: For every healthy node, last updated timestamp according to current information.

The information mentioned here is received from multiple sources(nodes) and is updated accordingly in the local memory of a node.

For making the communication possible, there will be a need for a daemon or service running on every machine which can interact with other nodes in the cluster for:

  • sending information about its health status
  • sending all the information it has about the cluster
  • receiving information about some node’s health status
  • receiving information about the cluster from some other node

A lot of corner cases and unanswered questions

Its a 1000 feet view of working of Gossiping algorithms, there are tons of unanswered questions and corner cases, still remaining which obviously are beyond the scope of this write-up. Few of them being:

  • How to decide if a node is up or not.
  • What ports to use to send or receive information
  • What to do with a failed node
  • Which protocol to use to transfer the data among nodes
  • Impact on network bandwidth
  • What to do in case of network failure
  • Still, there are many remaining :P

Useful links which might be helpful to dive deeper