Algorithms used in distributed systems!!

12 min readDec 21, 2022

Paxos is a distributed consensus algorithm that is used to ensure that all nodes in a distributed system agree on the same set of data. It was originally developed by Leslie Lamport and is named after the island of Paxos in Greece.

Paxos works by allowing a group of nodes, called “acceptors,” to agree on a value. The process of agreeing on a value is called “proposing” and is done in three phases:

Prepare phase: A node, called the “proposer,” sends a “prepare” message to the acceptors, asking them to prepare to accept a value. The proposer also includes a unique identifier, called a “proposal number,” in the message.
Accept phase: If the acceptors receive a prepare message with a proposal number that is higher than any proposal number they have previously seen, they respond with an “accept” message, agreeing to accept the value proposed by the proposer.
Learn phase: Once a majority of acceptors have accepted a value, the proposer sends a “learn” message to all the nodes in the system, announcing the accepted value.

An example use case for the Paxos algorithm might be in a distributed database system. Imagine a distributed database with three nodes, A, B, and C. Node A wants to write a new value to the database, but it needs to ensure that all the nodes in the system agree on the value. Node A can use the Paxos algorithm to propose the new value to the acceptors (nodes B and C). If a majority of the acceptors (in this case, both nodes B and C) agree to accept the value, then node A can send a learn message to all the nodes in the system, announcing the new value.

Raft is a distributed consensus algorithm that is used to ensure that all nodes in a distributed system agree on the same set of data. It is designed to be easy to understand and implement, and is often used as an alternative to the more complex Paxos algorithm.

Raft works by having a single leader node that is responsible for accepting and committing new data to the distributed system. The leader communicates with the other nodes in the system (called followers) to replicate the data and ensure that all nodes have a consistent view of the data. If the leader fails, the followers can elect a new leader to take its place.

One example use case for the Raft algorithm is in distributed databases. A distributed database can use Raft to ensure that all nodes in the cluster agree on the state of the data and can recover from failures.

Here is an example of how Raft might be used in a distributed database:

A client sends a write request to the leader node of the distributed database.
The leader node receives the write request and appends it to its log of pending writes.
The leader node then sends the write request to the followers and waits for a majority of the followers to acknowledge receipt of the request.
Once a majority of the followers have acknowledged the request, the leader node commits the write to its log and sends a commit message to the followers.
The followers receive the commit message and apply the write to their own logs.

This process ensures that the write is replicated to a majority of the nodes in the cluster, and that all nodes have a consistent view of the data. If the leader node fails, the followers can elect a new leader and continue processing requests.

The Two-phase commit (2PC) algorithm is a distributed transaction algorithm that is used to ensure that a transaction is either fully committed or fully rolled back across multiple nodes in a distributed system. It consists of two phases: the prepare phase and the commit phase.

In the prepare phase, the coordinator node sends a “prepare” message to all participating nodes, asking them to prepare to commit the transaction. Each participating node then checks whether it is able to commit the transaction and sends a “ready” message back to the coordinator. If all participating nodes are ready to commit, the coordinator sends a “commit” message to all participating nodes, and the transaction is committed. If any participating node is not ready to commit, the coordinator sends an “abort” message to all participating nodes, and the transaction is rolled back.

The 2PC algorithm is often used in distributed systems to ensure that transactions are atomic (either fully committed or fully rolled back) and that data is consistent across all participating nodes.

Here is an example use case for the 2PC algorithm:

Suppose you have a distributed system with two nodes, Node A and Node B, and you want to transfer $100 from an account on Node A to an account on Node B. To do this, you can use the 2PC algorithm as follows:

The coordinator node (Node A) sends a “prepare” message to both Node A and Node B, asking them to prepare to commit the transaction.
Node A and Node B each check whether they are able to commit the transaction. If they are, they send a “ready” message back to the coordinator.
The coordinator node receives the “ready” messages from both Node A and Node B, and sends a “commit” message to both nodes.
Node A and Node B each commit the transaction, transferring $100 from the account on Node A to the account on Node B.

Three-phase commit is a distributed algorithm that is used to ensure that a transaction is either fully committed or fully rolled back across multiple nodes in a distributed system. It is an extension of the two-phase commit algorithm and adds an additional phase to ensure that transactions are fully committed or fully rolled back.

The three phases of the three-phase commit algorithm are:

Preparation: In this phase, the transaction coordinator sends a request to all participating nodes to prepare for the transaction. The participating nodes check if they are able to commit the transaction and send a response back to the coordinator.
Decision: In this phase, the transaction coordinator waits for responses from all participating nodes. If all nodes are able to commit the transaction, the coordinator sends a commit request to all nodes. If any node is unable to commit the transaction, the coordinator sends a rollback request to all nodes.
Commit or rollback: In this phase, the participating nodes either commit or roll back the transaction based on the request received from the coordinator.

Here is an example use case for the three-phase commit algorithm:

Suppose you have a distributed system with three nodes, A, B, and C, and you want to transfer $100 from your bank account on node A to your savings account on node B. To do this, you need to update the balances on both nodes A and B. To ensure that the transaction is either fully committed or fully rolled back, you can use the three-phase commit algorithm as follows:

Preparation: The transaction coordinator sends a request to nodes A and B to prepare for the transaction. Node A checks if it has sufficient funds to transfer $100 and node B checks if it has sufficient space to add $100 to the savings account. Both nodes send a response back to the coordinator indicating that they are able to commit the transaction.
Decision: The transaction coordinator receives the responses from nodes A and B and sends a commit request to both nodes.
Commit: Nodes A and B update their balances and commit the transaction.

Three-Phase Commit Protocol (Wikipedia)

Vector clocks are a distributed algorithm that is used to detect and resolve conflicts in distributed systems. They allow nodes to determine the order in which events occurred and to identify conflicting updates.
Vector clocks are used to track the relative ordering of events in a distributed system. Each node in the system maintains a vector of timestamps, with one timestamp for each node in the system. When a node performs an action, it increments its own timestamp and sends the updated vector to the other nodes in the system.

For example, consider a distributed system with three nodes (A, B, and C). Each node maintains a vector clock with three timestamps, one for each node in the system. If node A performs an action, it increments its own timestamp and sends the updated vector to nodes B and C:

Node A: [1, 0, 0]
Node B: [0, 0, 0]
Node C: [0, 0, 0]

Node A performs an action

Node A: [2, 0, 0]
Node B: [1, 0, 0]
Node C: [1, 0, 0]

If node B then performs an action, it increments its own timestamp and sends the updated vector to nodes A and C:

Node A: [2, 1, 0]
Node B: [1, 1, 0]
Node C: [1, 1, 0]

Node B performs an action

Node A: [2, 2, 0]
Node B: [1, 2, 0]
Node C: [1, 2, 0]

Vector clocks can be used to determine the causal relationship between events in a distributed system. For example, if node A receives a vector from node B with a higher timestamp for node B than its own vector, it can determine that the event occurred after the event on node A.

You can learn more about vector clocks and how they are used in distributed systems in this article from microservices.io:

https://microservices.io/patterns/data/vector-clocks.html

Lamport clocks, also known as logical clocks, are a type of vector clock that are used to determine the order in which events occurred in a distributed system. They are named after their inventor, Leslie Lamport, who introduced them in a 1978 paper titled “Time, Clocks, and the Ordering of Events in a Distributed System.”

In a distributed system, each node maintains a logical clock that is incremented every time an event occurs on the node. The value of the logical clock is then included in the message that is sent to other nodes when the event is communicated to them. This allows the receiving node to determine the order in which the events occurred.

For example, consider a distributed system with three nodes: A, B, and C. Each node maintains a logical clock that is incremented every time an event occurs on the node. If node A sends a message to node B, it includes the value of its logical clock in the message. When node B receives the message, it compares the value of the logical clock in the message to the value of its own logical clock. If the value in the message is greater than the value of its own logical clock, it increments its own logical clock to the value in the message plus one. This ensures that the logical clocks on the two nodes are consistent with the order in which the events occurred.

Lamport clocks are often used in distributed systems to order events and to detect and resolve conflicts. They are particularly useful in microservice architectures, where multiple services need to coordinate and communicate with each other.

The Chandy-Lamport algorithm is a distributed algorithm that is used to determine the causal relationship between events in a distributed system. It works by assigning a timestamp to each event in the system and using the timestamps to determine the order in which the events occurred.

One common use case for the Chandy-Lamport algorithm is in distributed systems that need to maintain a consistent view of data across multiple nodes. For example, consider a distributed database that needs to ensure that all nodes have the same view of the data. The Chandy-Lamport algorithm can be used to ensure that the nodes have a consistent view of the data by determining the order in which the data was written to the database and applying the writes in that order.

Byzantine fault tolerance is a type of distributed algorithm that is designed to handle arbitrary (Byzantine) failures in a distributed system. The term “Byzantine” refers to the idea that a node in the system may exhibit arbitrary, unpredictable behavior, such as sending conflicting messages to different nodes or lying about the state of the system.

Byzantine fault tolerance algorithms are designed to ensure that a distributed system can reach consensus and continue operating correctly even in the presence of Byzantine failures. This is achieved by requiring a supermajority of nodes to agree on the state of the system before a decision is made.

One example use case for Byzantine fault tolerance algorithms is in distributed consensus protocols, such as blockchains. In a blockchain, Byzantine fault tolerance is used to ensure that the network can reach consensus on the state of the blockchain, even if some nodes exhibit arbitrary or malicious behavior.

Reference: “The Byzantine Generals Problem” by Leslie Lamport, Robert Shostak, and Marshall Pease: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/12/The-Byzantine-Generals-Problem.pdf

Consistent Hashing is a distributed algorithm that is used to distribute keys evenly across a distributed system. It works by assigning each key to a specific node in the system using a hash function. If a node is added or removed from the system, only a small number of keys need to be remapped to new nodes, rather than all keys needing to be remapped. This helps to distribute the load evenly across the nodes in the system and to minimize the impact of adding or removing nodes.

Example: Suppose you have a distributed system that consists of several nodes, each of which stores a range of keys. You want to distribute the keys evenly across the nodes to ensure that the load is balanced. You can use consistent hashing to do this by assigning each key to a specific node using a hash function. For example, you might use a hash function that maps keys to integers, and then use a modulo operation to determine which node a key belongs to.

For example, suppose you have a system with 3 nodes, and you want to assign the key “foo” to a node. You could use a hash function to map the key “foo” to the integer 12345, and then use the modulo operation to determine which node the key belongs to

node = 12345 % 3

This would result in the key “foo” being assigned to node 2.

MapReduce is a distributed algorithm for processing large datasets in parallel. It is a programming model that is designed to handle large amounts of data in a distributed environment and is often used for data processing tasks in distributed systems.
The MapReduce algorithm consists of two main phases: the map phase and the reduce phase. In the map phase, the input data is divided into chunks and each chunk is processed by a map function that performs some computation on the data. The output of the map function is a set of intermediate key-value pairs.

In the reduce phase, the intermediate key-value pairs are grouped by key and processed by a reduce function that performs a final computation on the data. The output of the reduce function is a set of final key-value pairs that represent the results of the computation.

MapReduce is designed to be fault-tolerant and scalable, and it is often used for tasks such as counting the occurrence of words in a large dataset or calculating the average of a set of numbers.

Here is an example use case for MapReduce:

Suppose you have a large dataset of sales data that you want to analyze. You want to calculate the total sales for each product category. You can use the MapReduce algorithm to do this by writing a map function that processes each record in the dataset and emits a key-value pair for each product category, where the key is the category and the value is the total sales for that category. You can then write a reduce function that sums the values for each key and emits a final key-value pair for each category, where the key is the category and the value is the total sales for that category.

“MapReduce: Simplified Data Processing on Large Clusters” (https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf)

Honorable mention for a distributed system playlist!!

Algorithms used in distributed systems!!

Written by Noob Blogger