The explosive growth of the internet and the reach of smartphones worldwide led to the growth of Distributed Systems which is inevitable to serve billions of requests per second with latency not exceeding a second. Ensuring Consistency and Availability in a geographically distributed commodity Linux machines spread across different data centres across the world need a consensus algorithm. Consensus is the process in which the machines agree upon a value to commit in distributed transactions where the machines and the network is prone to failures. The Blockchain world is primarily built on top of a consensus algorithm and it is the differentiating factor between various decentralized blockchain systems.
In order for distributed systems to have more availability, to avoid a single point of failure and reduce latency the data is replicated in multiple machines across the continents. In a highly replicated environment in distributed systems, consensus algorithms are used to reach consensus to commit a transaction in all the replicas.
A transaction has to be consistent even when there are multiple write requests concurrently to the same object or a file. Definition of consistency in a replicated environment is that all the replicas must be in either committed or aborted state after a transaction. To solve this transaction problem various consensus algorithms are used and one of those algorithms is Paxos — a family of algorithms implemented with some minor alterations. But before we jump into knowing about Paxos, we must understand the predecessor of Paxos — Two-phase commit (2PC) algorithm.
Two-phase commit protocol work as the name suggests works in two phases 1. Prepare and 2. Commit. One of the participants is chosen as Leader from the group of participants or there will be a static leader depending on the implementation and the others will act as Acceptors throughout the Leader lease period if the leader is elected dynamically. Electing a leader and gaining the leader lease period is altogether a whole different problem which is out of the scope of this article.
Leader initially issues the ‘Prepare’ message to all the Acceptors and the Acceptors responds with a ‘Prepared’ message if the acceptor doesn’t have any other transaction in progress on the same file. On receiving ‘Prepared’ message from all the acceptors involved in the transaction, the leader begins the second phase of the 2PC. The leader issues a ‘Commit’ message to all the acceptors. Then the acceptors commit the value to the persistence storage that is agreed upon in the previous phase. Acceptors respond optionally with ‘Committed’ or ‘Aborted’ message depending on the implementation. If there is an in-progress transaction the acceptor responds with ‘Abort’ message in the first phase. If the Leader receives an Abort message from any one of the acceptors, then the leader issues, ‘Abort’ message to all the other acceptors to cancel the transaction that the leader initiated.
Problem with 2PC
2PC comes with a problem that may increase the latency and transaction failure often. There are two major problems that 2PC protocol encounters in the large scale distributed systems. One is that when the leader fails after the end of the prepare phase, the acceptors are blocked indefinitely or through the leader lease period. The other one is that when any one of the acceptors fails to respond or times out, then the transaction is aborted by the leader.
Paxos is the successor to the 2PC with some modifications and optimizations. In the Paxos algorithm, the participants are separated into three groups namely Proposer, Acceptors and Learners. The significant optimization in the Paxos compared to 2PC is that there is no need for all the acceptors to accept the proposal from the Proposer. Instead, it is considered that the consensus is reached when the majority of the acceptors accepts the proposal. The other major change in Paxos over 2PC is that there can be more than one proposer at a time but in 2PC there can be only one leader. Other read-only replicas can be treated as Learners who do not participate actively in the commit protocol but learns the result from the proposer.
Proposer chooses an Id to send it to the acceptor. The id is chosen by a proposer in globally monotonically increasing order for all the participants. Google uses its TrueTime API — the globally synchronized clock API for getting timestamps and the timestamps are used as Ids which ensures that the proposers do not get duplicate Ids and also that the proposers get Ids in a linear chronological order. Proposer send ‘Prepare Id’ message to all the acceptors along with the chosen Id.
Acceptors, on receiving the ‘Prepare Id’ message checks if it already promised for an Id greater than the Id sent by the current proposer. If there is no promise with greater Id, then the acceptor promises not to accept for an Id less than the current Id. If there was a promise with greater Id, the acceptor ignores to respond. If there was a promise with lesser Id, then the acceptor responds with a promise and the accepted value.
Proposer, on receiving the ‘Promise Id’ message from the majority of the Acceptors checks if there is any value associated with the promise. If there are values from the acceptors, then the value with the highest Id is chosen as the accepted value. If there is no value from the acceptors, then the value that the Proposer proposed is chosen as the value. Proposer, on choosing the accepted value issues an ‘Accept’ message to all the acceptors along with the value it chose.
Acceptors, on receiving the ‘Accept Id’ message with a value checks if it already promised not to accept the Id. If it had already promised not to accept then the message is ignored. If not, then the consensus is reached and the Acceptors accepts the value given by the Proposer. On accepting a value, the acceptors broadcast the accepted Id and value to all the learners.
Google & Paxos
On researching Google’s various distributed storage systems, Paxos is clearly one common candidate that is widely used all across those storage systems.
While many systems use Paxos solely for locking, master election, or replication of metadata and configurations, we believe that Megastore is the largest system deployed that uses Paxos to replicate primary user data across datacenters on every write
Megastore uses Paxos, not just for master election or metadata replication, it uses Paxos for every write. Yes, each writes. Not just in Megastore but it is also used to reach consensus in almost all the storage systems in Google that I came across. Megastore, Spanner, F1, Chubby all use it.
Luis Quesada Torres, in this video, explained Google’s version of Paxos implementation, contention prevention and concurrent consensus attempts by multiple proposers. If you are interested in further research on this topic, the references will provide greater insight into many Distributed Systems.
- Megastore: Providing Scalable, Highly Available Storage for Interactive Services https://storage.googleapis.com/pub-tools-public-publication-data/pdf/36971.pdf
- Spanner: Google’s Globally Distributed Database — https://storage.googleapis.com/pub-tools-public-publication-data/pdf/65b514eda12d025585183a641b5a9e096a3c4be5.pdf
- The Chubby lock service for loosely-coupled distributed systems https://storage.googleapis.com/pub-tools-public-publication-data/pdf/c64be13661eaea41dcc4fdd569be4858963b0bd3.pdf
- F1: A Distributed SQL Database That Scales — https://storage.googleapis.com/pub-tools-public-publication-data/pdf/41344.pdf