The Hitchhikers Guide to Distributed Systems: Part 1

Published in

Weaver Labs

6 min readJun 29, 2021

A consistent, available and partition tolerant ATM

In part 1 of our series on distributed systems, we will give an overview of what a distributed system is and also go over one of the fundamental theorems that must be considered when understanding how they work.

We are often faced with situations that aren’t suited to the design of a single computer system, for example, most websites aren’t only hosted on single servers and many challenging mathematical or physical problems aren’t simulated or solved on single CPUs. Generally, we require the robust and flexible nature of many computers and devices working together as distributed systems.

Furthermore, a single computer can only do so much…

1. Physically, we can only put so many transistors on a single microchip, unless we figure out how to make electrons travel faster. What this means is that CPUs might be limited in terms of performance improvements over time.

2. A single computer system can only be so reliable. What happens if the power gets cut?

3. Many applications aren’t suited to a single machine, so even with your new 400,000 core CPU (see Cerebras Systems Wafer Scale Engine), our requirements might still not be met (e.g. an application might require local devices in more than one single geographical area or require the redundancy and reliability of several backup machines)

So how do we overcome these issues? Enter the distributed system.

A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. (Distributed systems: principles and paradigms.)

Distributed Systems can be used to tackle large computational problems or create resilient data storage solutions. It seems simple enough, if you want to backup a large data-set across several different machines then all you have to do is load the data onto each machine one by one and you’re set. However, what happens if the data goes through high churn, while at the same time requiring frequent access? If a user sends a request to a particular machine for part of the data-set, how can that user be sure that the data is the most up to date? What if that particular machine hasn’t been updated with the most recent data yet? This sort of problem is generally understood to be the Achilles’ heel of distributed, among others!

An archaic example of a distributed system that we all use is our banks network of Automated Teller Machines (ATMs). ATMs allow us to grab cash from pretty much anywhere (see point 3 above). ATMs demonstrate the core paradigm that all distributed systems face; the CAP Theorem (Consistency, Availability and Partition Tolerant). The CAP theorem states that in a distributed system, only two of the following properties can be true:

Consistent: Do all the computers in the Distributed System see the same information all the time?

Available: All the computers in the network must be online and ready to be used at all times

Partition Tolerant: A distributed system is partition tolerant if it continues to run correctly despite any amount of failure within the system (so long as a single machine/node continues to run). Imagine there are a group of 5 nodes and 2 of the nodes become disconnected from the rest, but remain in communication with another. If the distributed system is partition tolerant, it will continue to run despite the fact that the two disconnected nodes won’t be able to maintain an up to date state with the other 3 nodes.

In order to understand why these three properties of the CAP theorem can only exist as pairs, lets consider our above example of a distributed system, the ATM.

Given that only two properties of the CAP theorem can be true at any time, lets examine each of those cases.

Case 1: Consistent, Available but not Partition Tolerant.

• Suppose you are at a single ATM that is part of a larger system of ATMs (the distributed system). If all the ATMs in the network are required to be online, then data can only be consistent across the ATMs if the network is not Partition Tolerant. This is the case because partition tolerant networks are ones that continue to operate despite nodes within the network being temporarily out of sync in regards to instructions or state. Therefore, a distributed system can only be consistent and have all nodes available if there are no partitions in the network.

Case 2: Consistent and Partition Tolerant, but not Available.

• In order to fulfil the requirement of Partition Tolerant, we must continue to operate correctly, despite a failure within the system. A system failure can include a single node going offline (note: This is actually a challenging problem as it is difficult to determine whether a node is offline or simply not responding to messages). Regardless of the type of failure, the distributed system with N nodes must continue to operate so long as N-(N-1) > 0 are available. In the above ATM example, you clearly wouldn’t be able to withdraw cash from a powered off ATM and therefore, the availability property isn’t satisfied. Consistency can be maintained since the failed/offline nodes won’t be returning inconsistent data, or any data at all! Therefore, when a distributed system is consistent and partition tolerant, we cannot expect all nodes to be available.

Case 3: Available and Partition Tolerant, but not Consistent.

• Imagine a set of ATMs in which a single ATM temporarily gets disconnected from the distributed system network. For the Availability property to remain true, that single ATM would need to continue to dole out cash! However, given that this particular ATM is no longer in communications with the rest of the ATMs, the balance of any given user wouldn’t be consistent across the larger network of ATMs.

As a further example, lets consider an situation in which you and a lucky accomplice are looking for some free money. If the CAP theorem didn’t hold, I’d recommend you take advantage of Case 3 above. In case 3, you would look to find ATMs that are available, but partitioned from the network. The network wouldn’t be consistent and a result it wouldn’t be accurately keeping track of all of your withdraws across ATMs. Therefore, you and your accomplice could simultaneously withdraw cash from two machines in the network that are available and exist in different partitions. We are not ATM experts but I’m sure this case has been handled properly by our friends at the bank.

A slightly less fun, but interesting example is one in which the network of ATMs is consistent and partition tolerant…Case 2, but therefore not “available”. In this case, any of the individual ATMs that is on the wrong side of the partition will be forced to be unavailable as we require consistency across the network.

Finally, an example of Case 1 is one in which the network works as expected, but cannot handle nodes being partitioned from one another. This is because in the partition tolerant scenarios, we must either sacrifice consistency in which the partition creates multiple states of the network or availability where a set of nodes that are partitioned will be forced to be offline or inaccessible.

In all of the above examples we have only consisted the situation in which the network of distributed systems is one in which all the nodes and devices trust one another. Since the creation of blockchain systems, we have seen many distributed systems deployed that are able to operate in trust-less environments, which can create substantial performance issues for the distributed system. In a further article, we will example these issues in more detail.

At a high level the CAP theorem introduces the subtle complexities and challenges faced when trying to use or deploy a distributed system. In our upcoming overview of distributed systems we will look at topics such as consensus, routing and some interesting use cases.

The Hitchhikers Guide to Distributed Systems: Part 1

Written by Weaver Labs