Distributed system models

Mohammed Ragab
Nerd For Tech
Published in
5 min readApr 30, 2021

In distributed system world nothing 100 % reliable so we always consider faults so when designing a distributed system model is how we designate our premises about faults.

In general, we can categorize models into three

1- Network behavior

2- Node behavior

3- Timing behavior

Network behavior

As no network is reliable in the real-world even if we designed systems with replicated network links because maybe someone easily unplugs the network cable wrongly, the network may be overloaded or network may be attacked, and so on. As we deal with reliability problems in network and message loss we can select one of the following three models that indicate how reliable we want.

1- Perfect link or reliable link

In this model, a message is received only if is send so it is a high level of grantee that we will receive the message correctly maybe messages re-ordered but in fact, in the real world this model is not easy to be assumed.

2- Fair-loss links

In this model, messages may be lost or duplicated wrongly and re-ordered as well but if you keep retrying to send messages it eventually get through but we can not identify the time to be through but theoretically if we keep retrying in an open time maybe 100 years :) the message will get through and we can turn fair-loss into a reliable link by keeping retrying this means that any network partition (network interruption) will last only for a finite period of time, but not forever, so we can guarantee that every message will eventually be received.

3- Arbitrary

This model is exactly what happens in communication over the internet when we use the public unsecured networks link in the Starbucks or any coffee shop the network operator can potentially interfere with and manipulate your network packets this also known as an active adversary but also possible we can turn it into fair-loss links by using cryptographic to secure our connections and network packets such as transport layer security (TLS) protocol to prevent an active adversary from dropping or spoofing our traffic but also TLS can not prevent communication blocking and after turning into fair-loss we can eventually turn fair-loss into reliable.

Thus, the assumption of a reliable network link is perhaps not as unrealistic as it may seem at first glance generally it is possible for all sent messages to be received, as long as we are willing to wait for a potentially arbitrary period of time for retries during a network partition

Node behavior

In this category we have to determine that the node is faulty, Each node executes a specified algorithm from the following

1- Crash-stop ( fail-stop)

In this model, a node is faulty if it crashes at any time and after the crash, it stops respond or executing jobs forever and it never recovers this is reasonable for unrecoverable hardware such as if your hardware got fired or dropped under the train and etc. if we assumed crash-stop the algorithm became more simple.

2- Crash-recovery (fail-recovery)

In this model, a node may crash at any time as well but it may resume again after sometimes such as operating system stuck or Kernel bugs then the node need restarting to re-join the distributed system again and resume processing after a crash and any unrecoverable storage will be lost such as memory but we can avoid this by using recoverable storage also we should take in consideration the node maybe crash-recover mode in infinite and this may lead to fail-stop as well.

3- Byzantine (fail-arbitrary)

This is the most common node behavior, in a byzantine problem the faulty node may not only crash but as every node in your system should follow an algorithm we can consider a node is faulty if it deviates from the algorithm in unpredictable or arbitrary ways such as malicious behavior as we mentioned before also a node implementation bugs can be classed as Byzantine fault.

In the case of the network, it was possible to turn from one model into another. This is not the case with the different models of node behaviour. For example , if an algorithm designed for a fail-recovery may look very different from a Byzantine algorithm and so on

Timing behavior assumptions

The timing behavior is about synchrony assumption so we have three models synchronous, asynchronous, or partially synchronous but take into your consideration that the definitions of these terms differ somewhat across different parts of computer science so we are defined them in the distributed computing field in order not to mix things up.

1- Synchronous

In this model, we have a max time for the message to be delivered so the message latency is no greater than a known upper bound and we can consider as well that the node executes the algorithm at a known speed and everyone who deals with distributed systems loves this model and many problems in distributed computing are much easier if you assume a synchronous system :) but unfortunately most of the time the nodes do not have the same behavior and algorithms designed for synchronous models often fail disastrously if the assumptions of bounded latency and bounded execution speed are violated.

2- Asynchronous

In this model we make no timing assumptions at all, for example, we allow messages to be delayed unpredictably and the nodes can execute the algorithm at different processing speeds from each other. asynchronous-based algorithms are very strong because they are unaffected by network interruptions or latency but on another side, some distributed computing problems can not be got solved by Asynchronous algorithms as well so we have the third model partially synchronous come to terms.

3- partially synchronous

In this model, we assume that our system is synchronous and behaves perfectly most of the time, but sometimes it may turn into an asynchronous mode in which all timing guarantees are off, and this can happen unpredictably.

Resources:

Introduction to Reliable and Secure Distributed Programming book

--

--