Consistency & Consensus for System Design Interview (7): two phase commit

SystemDesign
Tech Wrench
Published in
7 min readMay 19, 2024

PREV | HOME | NEXT | System Design Resource List

Don’t forget to get your copy of Designing Data Intensive Applications, the single most important book to read for system design interview prep!

Check out ByteByteGo’s popular System Design Interview Course

Introduction

Distributed transactions based on two-phase commit can suffer from performance issues because of forced syncing to disk and additional round trips. Distributed transactions may be slower by several magnitudes than single node transactions for some databases. Distributed transactions can be categorized as follows:

  1. Database-internal distributed transactions: These are distributed transactions that occur in databases running on several nodes that support partitioning and replication. The distributed transaction is internal to the database and executes over the nodes of the database. In this case, all the nodes run the same software and are part of the same database system.
  2. Heterogeneous distributed transaction: These are distributed transactions that execute across disparate systems running different softwares. Here the transaction occurs between two different technologies, e.g. a message broker such as Kafka and a relational database such as MySQL. The challenge in these scenarios is to ensure that a transaction commits atomically across disparate technologies.
Grokking Modern System Design for Software Engineers and Managers

Distributed transactions that are internal to a database don’t have to work with other systems/technologies and are at freedom to implement optimizations and modify protocols as needed to improve performance. This makes distributed transactions internal to databases exhibit better performance than heterogeneous transactions. However, there are still uscases for heterogeneous distributed transactions. For example, consider a system where a Kafka message describes a credit card expense that is read by an application and the expense amount is updated in a relational database. An application should only acknowledge the Kafka message as read after it has already updated the expense in the database. The two operations, updating the database and acknowledging the Kafka message must succeed as a combination or be rolled-back. We can’t have one operation succeed and the other fail. This can be accomplished if the two systems support distributed transactions across system boundaries.

Note that the Kafka message is only delivered once to the application. If the update to the database fails, the message is not acknowledged and the message broker can redeliver the message. From the application’s point of view the message is processed exactly once.

Land a higher salary with Grokking Comp Negotiation in Tech.

A standard called, extended/open architecture, shortened as XA was introduced in 1991 for implementing two-phase commit across system boundaries. The standard has been widely adopted and implemented by several databases (PostgreSQL, MySQL, DB2 etc) and message brokers (IBM MQ, MSMQ, HornetQ etc). The goal of XA is to guarantee atomicity in “global transactions” that span heterogeneous components. XA is a specification that governs how an XA transaction manager can tell a database (e.g. Oracle, MySQL, PostgreSQL) what work is going on as part of what transaction, and how to conduct the two-phase commit (2PC) protocol at the end of each transaction. XA provides the glue between the transaction manager and the data stores for two-phase-commit.

Ace the machine learning engineer interview with Grokking the Machine Learning Interview.

An application interacts with the XA-compliant transaction manager via a client library. Database drivers or message broker drivers that are used by an application to communicate with databases/message-brokers support XA interface. The coordinator can invoke the callback functions on these drivers to ask the database/message-broker to prepare, commit or abort. The communication between the coordinator and the datastores happens through the driver and not directly.

In practice, the transaction manager is usually loaded along with the application as a library or as a separate process and not a separate service. The transaction manager tracks the participants in the transaction (i.e. the various data stores to which the application writes), and works with them to carry out the two-phase commit. The XA-compliant transaction manager maintains a log of its decisions to commit or roll back, which it can use to recover in case of a system outage. This log is maintained on the same server on which the application runs and in case the machine or the application process crashes, so does the coordinator. The currently executing transactions may become in-doubt transactions and when the coordinator recovers, such transactions get resolved by the coordinator, which invokes the driver’s XA callbacks to either commit or abort the transaction based off of the on-disk transaction log.

Check out the course Coderust: Hacking the Coding Interview for Facebook and Google coding interviews.

Consensus Algorithm

A consensus algorithm means getting several nodes to agree on something. More formally, in a distributed system, several participant nodes propose a value and the consensus algorithm picks one of those values as the chosen one for all the nodes. Consider the consensus problems we discussed earlier, e.g. multiple clients trying to book the same hotel room at the same time, trying to book the same seat on a flight or trying to buy the same item online etc. All of these problems can be thought of as a consensus problem where each client proposes its ID to the consensus algorithm and the consensus algorithm picks one winner ID, which gets to book the resource.

A consensus algorithm should satisfy the following properties

1. Uniform agreement

All nodes decide/agree-upon the same value.

2. Integrity

Each node decides only once, i.e. it is not allowed to agree to a value and then change its mind.

3. Validity

A node can only decide on a value which has been proposed by itself or by another participating node.

4. Termination

A node that doesn’t crash should eventually decide on a value.

Grokking the Coding Interview: Patterns for Coding Questions

The first two properties, uniform agreement and integrity form the core of the solution to the consensus problem. Validity exists to rule out trivial solutions, such as the consensus algorithm always deciding on the null value or some other constant. The fourth property, termination, exists to make the consensus algorithm fault-tolerant. In case if a node or some nodes in the system crash, the rest of the nodes in the system should be able to reach a decision. The two-phase commit protocol, can’t tolerate crashes and if a node crashes, the protocol must wait indefinitely for the node to come back up again. The termination property mandates that the algorithm make progress in face of node crashes. The number of failures a consensus algorithm can tolerate and still make progress is limited. If all the nodes crash, obviously no progress can be made. It has been proven that a consensus algorithm requires at least a majority of nodes to be functioning for the algorithm to make progress. However, most consensus algorithm implementations honor the other three properties, when a majority or more number of nodes are down or unreachable, i.e. the algorithm stops processing new requests but at the same time doesn’t make invalid decisions.

Get a leg up on your competition with the Grokking the Advanced System Design Interview course and land that dream job!

Following are some of the well-known consensus algorithms:

  1. Paxos
  2. Raft
  3. Viewstamped Replication
  4. ZAB (Zookeeper’s atomic broadcast protocol)

Recall that total order broadcast messages are delivered to all nodes in exactly the same order. Another way to look at total order broadcast is to visualize each value in the order as the outcome of a round of consensus among the participating nodes. To get the next value in the total order, each participating node proposes a value to the consensus algorithm, which decides on one of proposed values to be the next value in the total order. Thus we can consider the total order broadcast to be repeated runs of the consensus algorithm among the nodes. For each round of the consensus algorithm, the uniform agreement property assures that all nodes decide on the same value to deliver in each round and consequently all nodes receive messages in the same order, the integrity property assures that values/messages are never duplicated, the validity property assures that messages/values aren’t corrupt or made-up, and finally the termination property assures that messages aren’t lost.

Interviewing? consider buying our number#1 course for Java Multithreading Interviews.

The above listed algorithms (Paxos, Raft, etc) decide on a sequence of values rather than a single value which makes them total order broadcast algorithms. However, the algorithms Viewstamped Replication, Raft and ZAB don’t run multiple rounds of consensus to construct the total order as that is efficient, rather they implement the total order directly. In case of Paxos this optimization is called Multi-Paxos.

Your Comprehensive Interview Kit for Big Tech Jobs

0. Grokking the Machine Learning Interview
This course helps you build that skill, and goes over some of the most popularly asked interview problems at big tech companies.

1. Grokking the System Design Interview
Learn how to prepare for system design interviews and practice common system design interview questions.

2. Grokking Dynamic Programming Patterns for Coding Interviews
Faster preparation for coding interviews.

3. Grokking the Advanced System Design Interview
Learn system design through architectural review of real systems.

4. Grokking the Coding Interview: Patterns for Coding Questions
Faster preparation for coding interviews.

5. Grokking the Object Oriented Design Interview
Learn how to prepare for object oriented design interviews and practice common object oriented design interview questions

6. Machine Learning System Design

7. System Design Course Bundle

8. Coding Interviews Bundle

9. Tech Design Bundle

10. All Courses Bundle

Photo by Sajad Nori on Unsplash

--

--