Why is it Important to Learn Distributed Systems?
Over the course of years, I have realized that no matter what technology I try to learn, I not only learn it better but also enjoy learning it a lot if and only if I have gracefully met the pre-requisites of it.
I have come to this realization that for some niches, the general fundamentals never really go away, and the better one understands the general fundamentals, the better one can perform in the depths of that niche.
Given that distributed systems are at the backbone of:
- Cloud computing
- Blockchain technology
- System design
- Software architecture
… etc., I deeply feel we must understand distributed systems properly to understand well everything that hinges upon it.
In software engineering interviews, system design questions often show up — especially in the interviews for the senior positions. Recently, even for rather junior positions, like the mid-level engineering interviews, many technical recruiters touch upon system design. So, understanding distributed systems becomes very vital for us to not only be able to architect better software in our day-to-day job, but also to stand out as better engineers during the software engineering interview process and qualify for higher positions in the corporate engineering hierarchy.
Article Outline
In this blog post, we will be going over some basic but very, very, very crucial concepts of distributed systems. The main topics covered in this article include:
Part-0: Understanding Distributed Systems & Their Need
0-A. What is a Distributed System?
0-B. The Need for Distributed Systems
Part-I: Understanding the Design & Algorithms of Centralized Vs. Decentralized Distributed Systems
- Centralized Vs. Decentralized DSs
- Leader Election in Centralized DSs
- Centralized Vs. Decentralized Failure Detection
- Centralized Vs. Decentralized Distributed Consensus
Part-II: Understanding the Main Trade-Offs in Distributed Systems
5. Partition Tolerance & The CAP Theorem
6. Trade-Off: Availability Vs. Consistency
7. Quora – Write & Read Quora
8. Trade-Off: Latency Vs. Fault Tolerance
Part-III: Beginner-Friendly Misc. Topics in Distributed Systems
10. Consistent Hash Ring & Virtual Nodes
11. Distributed Time Synchronization Via Logical Clocks
So, without any further ado, lets’s dive in! But before we look at the individual topics, we need to see what a DS really is.
Part-0: Understanding Distributed Systems & Their Need
0-A. What is a Distributed System?
A distributed system (abbreviated as “DS”), putting it as simply as I can, is a computer system where more than one physical computers are working together — by establishing communication with each other over the internet — to run one, single software application.
In other words, the components of a single application are spread over multiple machines, but they are presented to us after getting aggregating from multiple sources, appearing as if they are coming from a single source.
For Example: If we are using Facebook, and we open up our profile page, we see multiple components, like our timeline pictures, timeline videos, friends list, “people you may know” suggestions, past events and check-ins, etc. All of them are coming from different computer services (called microservices), but to us it appears as if a single server is serving us this entire web page — the timeline.
0-B: The Need for Distributed Systems
The population of the world is around 8 billion. The total number of people with access to smartphones and the internet is around 6.4 billion. This equates to only one thing: more and more users using the networking service to get access to application software globally. Hence, we need to make our app’s infrastructure scalable to accomodate all these users. Distributed systems help us achieve that.
There are several other benefits to using DSs too, like:
- Partition tolerant aystems
- Failure tolerant systems
- Highly available systems
- Highly durable systems
- Low latency, owing to multi-zone replication
… and the list goes on. Suffice it to say that none of the big tech software services — Microsoft, Meta, Amazon, Apple, Alphabet, Netflix, Google, Oracle — can exist if distributed cloud computing had not been invented by us.
Part-I: Understanding the Design & Algorithms of Centralized Vs. Decentralized Distributed Systems
01. Centralized Vs. Decentralized Distributed Systems
DSs can be centralized as well as decentralized. We must understand that distribution has got absolutely nothing to do with decentralization.
Centralization refers to the degree of “control” within the system, whereas distribution refers to having more than one physical machines, which may or may not be present at different geo locations.
So, if one central entity exerts all the control in that system — like the central coordinator of the CNS (Central Nervous System) of human beings, i.e., the brain — then the system will be centralized regardless of its distribution status.
For Example: Imagine we have lots of users in the system, and we want to scale up the DB system. So, we create two copies of our DB, thereby replicating the data. Now, let us say that this system allows all three of the DB instances to be able to serve read requests for all our clients. But if we put a constraint on our system that only the DB instance “A” can accept write requests from the users, and then afterwards it will forward that new data to be written to the replicas “B” and “C” too, then it is a centralized DS, where the central point of the DS is at the DB instance “A”.
DB server A, therefore, is called the “leader” of this DS. Centralized DSs are also many times called “leader-based” DSs. Decentralized DSs are sometimes called “leaderless” DSs.
The Next Question: Now, the question is, how do we elect a leader for our DS in a centralized DS? Because that is the only way we can even start such a centralized DS!
02. Leader Elections in Centralized DSs
So, if there is a centralized DS, then there must be a way to pick the leader of that DS, right? We have algorithms for that. There are two reasons why we need an algorithm to pick the system’s leader:
- The leader is prone to failure and crashes — so somebody should fill in for them as the new leader. Afterall, we must have a leader who can decide for all how to proceed forward.
- The algorithm would take care of automating the system for choosing a new leader, i.e., in case the leader crashes, the system won’t need any human intervention for choosing a new leader.
The Problem: The problem, however, is that all the replicas (servers) in our DS are many times of the exsct same power and have the exact same resources (RAM, CPU power, etc.), which is literally why they are called “replicas.” So, how do we choose the leader in such a situation?
Remember, the “old” leader has crashed. So, there is literally nobody to tell the “followers” what to do!
The Solution — Bully Algorithm: We can simply assign IDs to all the servers. The server with the biggest ID (the “big” guy — the rascal, the bully!) becomes automatically the leader in case there is no leader or the leader has crashed. In case they crash, the next biggest guy detects that failure and becomes the leader.
Note: There are other algos out there too for choosing a leader of a centralized DS, but for a beginner’s series, that would be a bit out of scope — I think. Therefore, we are not going into the details of those algos.
It is wort mentioning though that if we have a system where one server is disproportionately bigger than the others, then they naturally become the leader, and therefore an ID-based system might not be needed. We can hardcode in all the servers who the strongest replicas are. Naturally, the second strongest player will become the leader.
The Next Question: Now, the next question is, how do we even detect that our leader server has failed / crashed? Because that is the only way we can even start the new leader elections!
03. Centralized Vs. Decentralized Failure Detection in DSs
To start a new leader election in a centralized DS, we need a way at first to detect that the leader has crashed. There are two methods we can go about it. But before we go into the details of it, we need to see what a “heartbeat” is.
Heartbeat: This refers to an “empty” or a “bogus” message sent by a computer service, say a server, to another computer service, say another server, to let it know that it is up and running and that its “heart” hasn’t stopped “beating” yet. The second service will therefore know that the first server — the leader of a centralized DS in our case — hasn’t crashed.
Now, let us see the two very famous methods of detecting failures.
Method # 01 — Centralized Failure Detection: Our leader can continually send heartbeat to a very, very, highly available distributed service, say “Zookeeper”, which is a KV store, and, therefore, we will know that there is no need for a new election, as the leader is alive and healthy.
Method # 02 — Decentralized Failure Detection: The probelm with the first approach is that it relies on Zookeeper, which is a single point of failure (even though it is not, as we mentioned earlier that it is a very highly available, distributed service itself), and can, hence, fail to detect crashes in the system. Furthermore, another issue with it is that it is an external system — using it incurs extra charges for my client, and as an architect I would always like to reduce the costs incurred by my clients. So, what do we do then? Well, let’s say that we make our system do what we, ourselves, should never do in our lives — especially with our colleagues in an office setting: gossip! Basically, we allow all our servers in the DS to send heartbeat to each other, so that they can all communicate or “gossip” with each other and successfully keep the track of all those who have crashed. This is called the “Gossip Algorithm.”
Note: Gossipping this way, verbosely, is an N-squared algo. It will consume a lot of space in the mmeory and a lot of time too in a large distributed system. Therefore, usually some optimization of gosipping is used, where each server talks to only a subset of servers in the entire system.
The Next Question: Now, the next question is, if we have an alive and healthy leader (or of we don’t have a leader at all in our DS), how do we go on to deciding on things to reach a globally-accepted, world-wide consensus so to not stay stuck?
04. Centralized Vs. Decentralized Distributed Consensus
In a distributed system, we have several instances of a single component running on so many different servers. So, if more than one possibilities are present for our system at any point, then how do we reach a consensus on what path to take.
Centralized Consensus: For a centralized DS, the answer is crystal clear: the leader will decide everything.
But things become way more democratic and way more complex in a decendralized, leaderless distributed system where everybody truly acts like everybody else’s peer, like in the global blockchain networks.
Decentralized Consensus: For decentralized DSs, some kind of “voting” has to be performed to decide in which direction the system should move. There are several algos, like:
- Raft
- Paxos
… which can help us reach consensus in a very democratic and non-authoritative way. For a beginner’s high-level blog series, we will DELIBERATELY avoid going into the details of these algos. Similarly, in the blockchain world, we use algorithms, like:
- PoS (Proof-of-Stake) on Ethereum chain
- DPoS (Deligated Proof-of-Stake) on Hive chain
- PoW (Proof-of-Work) on Bitcoin chain
… etc. to reach consensus about what block to accept and add to the existing chain of the blocks. Once again, we will completely avoid diving into these algorithms, but I have tried to give the reader at least some awareness of the subject by sharing the probelm and the names of the solution.
Part-II: Understanding the Main Trade-Offs in Distributed Systems
05. Partition Tolerance & the CAP Theorem
Partition: In a DS, there can be a partition because of unexpected node (a group of servers, a data center) crashes. We would define a partitioned DS as a DS in which certain part of our DS is not able to communicate with another part of thre same DS.
Imagine we have 3 nodes, A, B and C in our system. Node A and C are indirectly connected with each other via node B. The node B, however, is directly connected with both A and C. Now the only way A and C can communicate with each other and be aware of each other is via node B. If B goes down (crashes), there will be no way for A and C to communicate with each other. Hece, we say, there is a “partition” in the system, as one node (or certain group of nodes on one side of the DS) cannot communicate with the other node (or a group of nodes on the other side).
Partition Tolerance: If we allow our distributed system to operate regardless of having internal partitions, then our DS is a partition tolerant DS. If we completely shut down the DS in the face of partitions, then it means we are not partition tolerant, as we expect the entire system to be perfectly operational internally to be available to the external world.
The CAP Theorem: This is perhaps the MOST IMPORTANT aspect of distributed systems, as it lays a solid foundation for understanding the trade-off between two very important DS elements. This theorem says:
In a distributed, partition-tolerant system, we can either have very, very strong data consistency at the expense of availability or very, very high availability at the expense of data consistency, but not both at the same time.
To understand the theorem better, we need to understand consistency and availability at first.
Consistency: Consistency here refers to data consistency. If we have multiple replicas of our DB present on different physical servers, then the data on all of them must be the exact same, i.e., they all must have the consistent data state.
Availability: Availability means that the system must be up and running. Meaning, if there is some node unavailable in the system, then the traffic should automatically get routed to the other nodes and the system should present itself as available.
The Next Question: The question now is, why the heck is there a trade-off between consistency and availability? What is the CAP theorem really trying to say?
06. Trade-Off: Availability Vs. Consistency
There is a fundamental trade-off between availability and consistency of a distributed data storage system.
Imagine we have a partitioned (some nodes have crashed) and replicated distributed database, like DynamoDB or MongoDB NoSQL instances. Now, if a client sends write request to one of the two nodes, the node has two options:
Option-01 — Favour Consistency: The node can reject the write request of the client, presenting itself as unavailable, because it cannot replicate the data (to ensure consistency) to the other node, as it is not connected with it (there is a partition in the system). Given that it cannot replicate data, accepting the write request means making the system’s data state inconsistent. So, we keep the state consistent, thereby not allowing anybody to write anything. Such a system in both nodes will be unavailable (for writing).
Option-02 — Favour Availability: Accept the write request by presenting itself as available, thereby making this particular replica of the DB as the “latest truth” of the data and leaving the other replica in the “old” state, and leaving the whole DB system in an inconsistent state. That’s why there is a trade-off.
Partitions are Inevitable: We cannot ensure both availability and consistency simultaneously, because that would mean having absolutely 0 probability of partition. But partitions are commonplace; they are inevitable; nothing can have 0 probability. We cannot avoid partitions. So, we always assume that our system will get partitioned sooner or later because of some fault / crash, and we place algorithms in place beforehand to ensure whether our system chooses consistency or availability.
Eventual Consistency Model: Usually, most systems prefer availability to consistency. Slowly, the data gets eventually replicated across all nodes and the entire system becomes consistent over time. This is the eventual consistency model of the distributed systems.
DynamoDB by AWS has an eventual consistency model of around 1 second. AWS S3, on the other hand, features around 15 minutes to reach a globally consistent state.
The Next Question: The question now is how do we go about changing the availability or consistency of the system? Well that we can do through distributed “quora”.
7. Quora – Write & Read Quora
A quorum refers to a subset of servers — or nodes — in our DS. Usually, such a subset has to agree over a certain action for it to be performed successfully.
Write Quorum: The write quorum refers to a subset of nodes that must replicate the incoming write request’s data before sending the confirmation to the client that the data has been written.
Read Quorum: The read quorum refers to a subset of nodes in our DS that must agree upon the consistency and correctness of a value (read from the DB) for the incoming read request of the client before sending the response back to the client.
Effects of Having Large Quora: If we have a large read and write quora, it will produce the following effects on our system:
- Strong Consistency: Replicate data internally to multiple nodes and reach a rather consistent state or read the same data from multiple replica nodes and then send back the most consistent and correct value
- Increased Latency: Make the client wait for us to replicate data or to confirm the read value from our DB from multiple replicas
- Possible Unavailablity: If the write and read quora are 50% of the nodes, and 90% of the nodes have crashed in our system, then we present the system as completely unavailable to writing and reading.
On the other hand, if the read and write quora are very small, we will have the exact opposite effects: eventual consistency, low latency, high availability.
The Next Question: How can the read and write quora help us understand the funndamental trade-off between latency and fault tolerance? Let us see how we can adjust the quora size to adjust the latency and our tolerance for faults.
8. Trade-Off: Latency Vs. Fault Tolerance
The effects of changing the write and read quora present a fundamental problem: we can either have low latency or low fault tolerance but not both.
Imagine we have a replicated distributed database, like DynamoDB or MongoDB NoSQL instances. Now, if a client sends write request to one of the two nodes, the node has two options:
Option-01 — Low Latency, High Fault Tolerance: The node can choose to not replicate the data internally and send back to the client write confirmation. In such a case we have very low latency, but given that our write quora is only one, we run the risk of losing this data, as this replica might crash before replicating data.
Option-02 — Low Fault Tolerance, High Latency: Here in this case the node will choose to replicate the data internally and first and then send back to the client write confirmation. In such a case we have rather high latency, but given that our write quora is not only one server, we do not run the risk of losing this data, as this replica has replicated the data to a couple of servers, and in case of it crashing, we can retrieve the data from some other server.
Hence, it is very important to control our quora and reach an appropriate level of latency with appropriate amout of tolerance for faults in our system, leading to delayed consistency or complete data loss.
Part-III: Beginner-Friendly Misc. Topics in Distributed Systems
9. Consistent Hash Ring & Virtual Nodes
Often times in a DS, we have to store static assets. Those static assets, like Netflix videos, can have a very large size — dozens of GBs. So, rather than storing the entire file in a single node, which may not even have enough space left, we spread the file across multiple nodes by splitting it up into multiple chunks.
Problem — Retrieving Chunks of Data: Well, if we have a file spread across servers in chunks, how do we find it the next time? The answer is via hashing. We hash each chunk, and the answer of the hash (when kept within the range of total nodes) gives us the server where the file is.
Problem — Addition of New Nodes: If we add a new node, all the previous data will have to be re-mapped to the new set of nodes. So, the system won’t scale. The same problem will be there if we remove a node from the system. Therefore, we use a consistent hash ring, where all the nodes are presented on a circular ring. The way a hash ring works is that the values on the ring are marked between 0 and 99. Each node takes up responsibility of those numbers which are after its predecesor node. So, the following two problems get solved:
- Node Crashing: If this node crashes, we will only remap their keys to the next server on the ring, moving in a clockwise fashion
- Addition of Node: If we add a new node, only a subset of the successor’s keys will be remapped to this new node.
So, such a ring accomodates addition and removal of nodes very well, thereby ensuring system scalability.
Problem — Unfair Load Balancing After Crashes: If there are only a few nodes in the system, or if one node is controlling a lot of keys in the server, upon crashing, it leaves such a big baggage for the next node, its successor on the ring. So, we use virtual nodes mapped to physical nodes in the system. So, if one of the nodes crashes, its keys are distributed to multiple virtual nodes present on the ring until its physical successor node. Each virtual node is connected to a specific real node, thereby distributing keys somewhat evenly.
11. Distributed Time Synchronization Via Logical Clocks
A wall clock is really a clock. Well, it tells us time. It acts like a clock. A clock solves a fundamental problem for us: it tells us time in an ordered way. For example, if 4:01 PM always comes before 4:02 PM, then looking at the clock we can umderstand how to order our life events in the past, present and the future based on that time. For example, if Jason takes shower at 3:00 PM and takes his lunch at 3:30 PM, we can easily order both these events in a topological order, where showering comes before eating lunch in the graph, always.
In a DS, we have to order events. It is EXTREMELY important for us to get the order of events right to remove any data inconsistencies from the system. The only problem is that there is no absolute time in distributed systems that every node can rely on. Every node operates in its own geo region, so their clocks are not really the single truth we can rely on. Event A happens in the UK at a specific relative time and the same event happens at the same instant but at a different relative time in the US.
So, how do we synchronize our DS if our clocks are not synchronized? The answer is using clocks which help us order events based on what happened first, no matter at what relative time it took place. Time is represented as versions of the document. 99 always comes before 100. So, version 99 of the doc occurred before 100.
For example, MongoDB and DynamoDB both use logical clocks to remove data inconsistencies. MongoDB allows for the submission of conflicting versions by multiple transactions, then create graphs internally to resolve conflict (MVCC). DynamoDB does the same using vector clocks.
Time is represented as integers, not absolute physical time. If an int is smaller, that event happened before the other one — that’s the rule in logival clocks.
Until next time!