Consistency & Consensus for System Design Interview (5): total order broadcast

SystemDesign
Tech Wrench
Published in
8 min readOct 27, 2023

PREV | HOME | NEXT | System Design Resource List

Don’t forget to get your copy of Designing Data Intensive Applications, the single most important book to read for system design interview prep!

Check out ByteByteGo’s popular System Design Interview Course

Introduction

Total order broadcast also known as the atomic broadcast is described as a protocol to exchange messages among the nodes of a distributed system. Note that the word “atomic” in the phrase “atomic broadcast” doesn’t imply atomicity in the sense as used in the context of transactions or operations. A total broadcast satisfies two requirements:

  1. Messages are guaranteed to be delivered. If a particular message is received by one node, then that message must have been received by every other node in the system eventually.
  2. All nodes receive messages in exactly the same order. A node can’t retroactively insert a message in between the order of already received messages. Another way to think about message delivery is that of an append-only log, where new messages can only be appended to an existing list of messages. Since all messages are delivered in the same order to all the nodes, this hypothetical log can be read by any of the nodes and the nodes will find the messages in exactly the same order.
Grokking Modern System Design for Software Engineers and Managers

The above two guarantees are ideal when replicating a database. If every node in the system receives each write that is directed to the database in exactly the same order then all nodes must eventually catch up to the latest state of the database and be consistent with each other. There might be temporary delays before each node achieves the latest state of database due to network delays or such. This is also known as state machine replication.

Ace Java Multithreading and Concurrency questions with our best-selling course for interview prep.

The algorithm that implements total order broadcast must satisfy the two conditions mentioned earlier. The algorithm should be resilient and attempt redelivery of a message in face of network failures until the message is successfully delivered. Zookeeper defines a protocol called ZAB (Zookeeper Atomic Broadcast) that ensures replication is done in order across all Zookeeper nodes and is also responsible for leader election.

Land a higher salary with Grokking Comp Negotiation in Tech.

There are different types of broadcast protocols that are listed below but we don’t cover all of them. In the context of distributed systems our focus will be on total order broadcast.

  1. Best-effort broadcast
  2. Reliable broadcast
  3. FIFO broadcast
  4. Causal broadcast
  5. Total order broadcast
  6. FIFO-total order broadcast
Ace the machine learning engineer interview with Grokking the Machine Learning Interview.

The broadcast protocol is implemented by a broadcast algorithm. A node participating in the broadcast protocol sends a message to every other node in the system, which constitutes a broadcast. The following image depicts nodes participating in a broadcast:

A message broadcasted by a node is also delivered to the broadcasting node itself by the broadcast algorithm. A node doesn’t know the position of the message it is about to broadcast in the total order and thus can’t simply append to the boradcast sequence the node maintains.

Now we’ll work through an example of a total broadcast. Consider three nodes A, B and C as shown below:

Say, the node A broadcasts two messages, m1 and m3 while node B broadcasts message m2. Furthermore, let’s assume that the total order has been determined as m1, m3 and m2 but the order in which the messages are sent out on a timeline is m1, m2 and m3.

Check out the course Coderust: Hacking the Coding Interview for Facebook and Google coding interviews.

  1. Node A broadcasts m1, which is delivered to itself and also to other nodes.
  1. Node B broadcasts m2, which is received by all the nodes successfully on the network but since m2 occurs later in the total order, the broadcast algorithm holds back message m2 from delivery to each node.
  1. Node A broadcasts m3, which is received on the network and then successively delivered to all the nodes.
  1. Finally, the message m2 is delivered to all the nodes and the total order broadcast is honored.

Equivalence to Consensus

Implementing total order/atomic broadcast is equivalent to implementing consensus among the participants of the broadcast protocol. In order for the conditions for atomic broadcast to be satisfied, the participants must effectively “agree” on the order of receipt of the messages. Participants recovering from failure, after the other participants have “agreed” to an order and started to receive the messages, must be able to learn and comply with the agreed upon order.

Grokking the Coding Interview: Patterns for Coding Questions

We can also build linearizable storage using total broadcast and vice versa. In fact building a linearizable compare and set operation and total order broadcast are both equivalent to the consensus problem. Note that total order broadcast and linearizability though related aren’t equivalent. Total broadcast order guarantees that messages are delivered reliably in a fixed order but makes no guarantee of when the messages get delivered. In contrast, linearizability is a recency guarantee, which implies that a read always returns the latest committed write.

Get a leg up on your competition with the Grokking the Advanced System Design Interview course and land that dream job!

In the username example we discussed earlier, the problem was that the same username could be registered by two users concurrently. One way to solve this problem with total order broadcast is to have a variable (or register in the context of distributed systems) that denotes whether a particular username has been taken or not and is initialized to false. Additionally, if a linearizable compare-and-set operation exists, we can use the operation to claim a username and set the register to true, indicating the username has been taken. Once we set the register to true, every following read of the register will return a value of true and let the node attempting to register the same username know that the username is already taken. The hard part in this scheme is implementing the compare-and-set operation. Interestingly, if we have total order broadcast implemented we can build such a linearizable compare-and-set operation on top of it as follows:

  1. Node attempting to register a username broadcasts a message with the username it intends to register. This can also be thought of as the node appending a message to a append-only log. Remember the message may not be delivered instantaneously to the node since it may be held back by the broadcast algorithm to honor the broadcast order.
  2. The node waits for the message it broadcasted, to be delivered back to itself.
  3. Once the node’s own message has been delivered. It can go through the messages and see if the message it sent is the first message that has the username the node wants to register. If so, the node can claim the username. If the first message the node wants to register is from another node, then it implies that another node has already laid a claim to the desired username and the current node can abort the registration request.
Interviewing? consider buying our number#1 course for Java Multithreading Interviews.

Because all messages are delivered in the same order to all the nodes, the winner in case of concurrent writes is the first write in the broadcast order and the rest are aborted. This mechanism allows for linearizable writes but not linearizable reads. For linearizable reads we can follow the below options:

  1. Similar to how we implemented linearizable write using total broadcast, we can implement linearizable read by broadcasting a message which is akin to appending a message to an append-only log and waiting for the message to be delivered back. The read occurs at the point when the message is delivered back.
  2. Reads can be made against the leader or a replica that is synchronously updated on writes, thus always returning the latest value.
  3. If it’s possible to know the sequence number of the latest message that has been broadcast in a linearizable way, then a node can wait until it receives all the messages upto the sequence number of the latest message and then execute a read.

If you are interviewing, consider buying our number#1 course for Python Multithreading Interviews.

In the flip case, where we can assume to have linearizable storage, we can build total order on top of it. Imagine having a simple counter variable/register that can be incremented using a compare-and-set operation or an increment-and-get operation. Nodes participating in the broadcast protocol have access to this register and access it to get the next sequence number. The retrieved number is assigned to the message being broadcast. The register increases in value monotonically and the broadcast algorithm delivers the messages to each node consecutively by sequence number. If a node has received messages numbered upto 10 and then it receives a message numbered 15, the node knows that it has to wait for messages numbered 11, 12, 13 and 14 to be received first before the node can deliver the message numbered 15. The missing messages may have been delayed due to a variety of reasons such as network delay.

Deep dive into mastering dynamic programming interview questions

Your Comprehensive Interview Kit for Big Tech Jobs

0. Grokking the Machine Learning Interview
This course helps you build that skill, and goes over some of the most popularly asked interview problems at big tech companies.

1. Grokking the System Design Interview
Learn how to prepare for system design interviews and practice common system design interview questions.

2. Grokking Dynamic Programming Patterns for Coding Interviews
Faster preparation for coding interviews.

3. Grokking the Advanced System Design Interview
Learn system design through architectural review of real systems.

4. Grokking the Coding Interview: Patterns for Coding Questions
Faster preparation for coding interviews.

5. Grokking the Object Oriented Design Interview
Learn how to prepare for object oriented design interviews and practice common object oriented design interview questions

6. Machine Learning System Design

7. System Design Course Bundle

8. Coding Interviews Bundle

9. Tech Design Bundle

10. All Courses Bundle

--

--

SystemDesign
SystemDesign

Written by SystemDesign

The ultimate Poor man’s system design interview prep guide -- https://systemdesign.medium.com/membership

No responses yet