Consistency & Consensus for System Design Interview (6): distributed transactions
PREV | HOME | NEXT | System Design Resource List
Don’t forget to get your copy of Designing Data Intensive Applications, the single most important book to read for system design interview prep!
Check out ByteByteGo’s popular System Design Interview Course
Introduction
Consensus, in the context of distributed systems, means having several nodes connected by a network to agree on something. It may sound trivial but amid the mayhem in a distributed system with failing nodes, new nodes joining, network delays, queue delays, GC pauses and such, the task to bring agreement among all nodes in the system becomes complex. Some of the scenarios where consensus is critical are listed below:
Leader Election
In single leader architectures, it is imperative that all nodes agree on which node is the current leader and at all times there should remain a single leader. In case of network partitions or node failures, another node may be elected as the leader. When a failover occurs incorrectly it is possible that two or more nodes consider themselves as the leader — a situation known as split brain — causing permanent data loss and/or data inconsistency.
Atomic Commit
Databases may be partitioned to scale and transactions may be supported on partitioned databases. For a transaction to honor atomicity from the ACID properties, it is imperative that a transaction against the database is either committed by all partitions or rolled-back by all partitions. A transaction shouldn’t succeed at some partitions and roll-back at others. This requires that a consensus should be formed among all the partitions to either commit or roll-back a transaction. This is known as the atomic commit problem.
Committing in a distributed system
Recall from our previous discussions that atomicity in transactions ensures that a database doesn’t reflect partial results of a transaction, whereby some operations within a transaction execute while the rest abort. Transactions that involve modifying multiple objects in a database or cause changes that require updating secondary indexes, must have the entirety of their operations run or not run at all. On a single machine database, a request to commit a transaction involves two steps:
- Writes are durably stored on disk
- A commit record is appended to a log on disk
From the point of view of the database, a transaction is committed only after the commit record has been stored on disk. If the database crashes at this point, it can recover the transaction by reading the commit log and knowing that the transaction has already been committed. In contrast if the database crashes at any point before the commit record has been stored on disk, then on recovery the database can roll-back any of the writes from the uncommitted transaction since it’ll know from its commit log that the transaction failed to commit before the database crash.
On a single machine database, a transaction can be rolled-back or aborted at any time before the instant when the commit record is written to disk. Thereafter the transaction is committed and can’t be undone. Atomicity on a single machine boils down to a single disk controller attached to the machine that writes the commit record.
Land a higher salary with Grokking Comp Negotiation in Tech.
In contrast when we talk about atomicity across several nodes, a single request to commit a transaction at all the nodes doesn’t suffice for the following reasons:
- A node may abort a transaction if a constraint, e.g. uniqueness constraint or foreign key constraint is violated.
- A node may fail to receive a commit request due to network delays and the request may be abandoned for delivery due to timeout.
- A node may crash after receiving the commit request but before it gets a chance to durably record the commit. Upon recovery the node will not find the commit record and roll back the transaction that other nodes may have committed.
Thus there can be scenarios where some of the nodes commit a transaction while others don’t, leading to inconsistency among the nodes. And once a transaction has been committed, it can’t be undone or aborted. This is because the data committed in a transaction becomes visible to other transactions, which may use or depend upon the committed data. The read committed isolation level is based on the principle that changes effected by a committed transaction can be read by other transactions and reverting them after another transaction makes a read is violation of the read committed isolation level.
Check out the course Coderust: Hacking the Coding Interview for Facebook and Google coding interviews.
Two Phase Commit
Atomic transactions across multiple nodes can be achieved using the classic distributed systems algorithm called two-phase commit. As the name implies there are two parts to the algorithm. Two phase commit involves a coordinator also called a transaction manager that communicates with the participating nodes to ensure an atomic commit. The transaction manager is also responsible for assigning globally unique transaction IDs to distributed transactions. Note don’t confuse two-phase commit with two-phase lock which we covered earlier and is an entirely different concept.
Broadly speaking, the first step in the algorithm involves sending a “prepare” request by the coordinator to all the participants to get ready to commit a particular transaction. If all the participants reply that they are able to commit the transaction, then the coordinator issues the second request to all the participants to commit the transaction. If any one of the participants responds with its inability to commit the transaction, for example due to constraint violation, or that the coordinator doesn’t hear back from any one of the participants due to a crash or network delay after the prepare request then the coordinator issues an abort request. A node can decline to commit a transaction up until the point it sends a yes response to the “prepare” request. Thereafter, the transaction can still be aborted but it’ll be upto the coordinator.
Let’s do a detailed walk through of the working of a two phase commit algorithm:
Get a leg up on your competition with the Grokking the Advanced System Design Interview course and land that dream job!
- An application initiates a distributed transaction and the coordinator assigns a unique global ID.
- The application begins a single node transaction at each participant node and executes reads and writes as desired. The transaction at each node is assigned the global transaction ID that was given out by the coordinator in the previous step. While executing the reads and writes if anything goes wrong, e.g. a node crash or a request timeout, the participant or the coordinator can abort the transaction.
- When the application signals it wants to commit the transaction, the coordinator sends out a “prepare” request containing the global transaction ID of the transaction to all the participant nodes. If for some reason the coordinator is unable to communicate with any of the participant nodes, the coordinator can subsequently issue an abort request.
- Upon receiving the “prepare” request, a participant node determines if it can definitely commit the requested transaction at a later time. Note that at this point the node doesn’t actually commit the transaction but simply checks if it can do so in future without violating any constraints or running into conflicts The node can either respond with a yes or a no. In case the node responds with a yes it surrenders the right to abort the transaction later on without actually committing the transaction.
- The coordinator receives responses to the “prepare” request from all the participant nodes. The coordinator makes a decision whether to commit or abort the transaction. The decision must be written to the transaction log so that in case the coordinator crashes it knows what it decided upon recovery from the crash. This is known as the commit point. Once the decision has been logged, it can’t be undone even if the coordinator crashes.
- If all the nodes respond with a yes, the coordinator can proceed to issue a commit request and if any one of the nodes responded with a no, an abort request is issued. Note that the commit request is only issued after a commit record has been added to the transaction log. Imposing the order for the commit record being added to the transaction log followed by the commit request being sent to all nodes, is important for the coordinator to recover from crashes. In case a participant node crashes or there’s a network partition, the coordinator keeps on retrying without giving up until the request is delivered to the non-responsive node. When the participant node recovers, it’ll either abort or commit the transaction as per coordinator’s direction.
If you are interviewing, consider buying our number#1 course for Python Multithreading Interviews.
In the two-phase commit protocol, note that a participant node can’t go back on its word, i.e. refuse to commit a transaction, once it responds with a yes to the “prepare” request and similarly the coordinator can’t rethink its decision once it has chosen to commit/abort a transaction and written it to the transaction log on disk.
The coordinator may crash or fail before it is even able to send out a “prepare” request. In that case, a node can unilaterally abort a transaction. However, once a node responds in the affirmative to a “prepare” request, it can’t take any action on its own until it hears from the coordinator. In fact, if the coordinator crashes or a network failure occurs, the node has no choice but to wait for the coordinator to get back to it. A transaction on the node in the state where the node is waiting to hear from the coordinator about the final fate of the transaction, is known as a transaction in doubt or uncertain transaction.
If the coordinator had indeed crashed after receiving responses from the participant nodes, then upon recovery, the coordinator can go through its transaction log and commit those uncertain/doubtful transactions for which a commit record exists, and abort those uncertain/doubtful transactions for which there is none.
Get insights on designing machine learning systems with Grokking Machine Learning Design.
It is important that the coordinator resolves in-doubt transactions, because in-doubt transactions can hold locks until resolved. Recall that transactions may acquire row-level exclusive locks to prevent dirty writes or hold shared locks on rows that have been read by a database using two-phase locking. If the coordinator crashes, then the in-doubt transaction will continue to hold the locks it has acquired, possibly preventing other transactions in the system from making progress, until the coordinator comes back-up and resolves the in-doubt transaction.
In practice there could be cases where a coordinator is unable to recover its state from the transaction log due to log corruption or a software bug. In such scenarios, in-doubt transactions may become orphaned and remain in the database forever until manually resolved. Restarting the database may not get rid of orphaned in-doubt transactions, since the locks held by such transactions are preserved across restarts of the database server. The solution generally involves human intervention where an administrator either rolls back or commits an orphaned in-doubt transaction at each participant node.
Your Comprehensive Interview Kit for Big Tech Jobs
0. Grokking the Machine Learning Interview
This course helps you build that skill, and goes over some of the most popularly asked interview problems at big tech companies.
1. Grokking the System Design Interview
Learn how to prepare for system design interviews and practice common system design interview questions.
2. Grokking Dynamic Programming Patterns for Coding Interviews
Faster preparation for coding interviews.
3. Grokking the Advanced System Design Interview
Learn system design through architectural review of real systems.
4. Grokking the Coding Interview: Patterns for Coding Questions
Faster preparation for coding interviews.
5. Grokking the Object Oriented Design Interview
Learn how to prepare for object oriented design interviews and practice common object oriented design interview questions
6. Machine Learning System Design