Distributed System Design Patterns

9 min readMay 7, 2022

Key patterns referring to common design problems related to distributed systems:

1. Bloom Filters

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. It is used where we just need to know if the element belongs to the object or not.

In BigTable (and Cassandra), any read operation has to read from all SSTables that make up a Tablet. If these SSTables are not in memory, the read operation may end up doing many disk accesses. To reduce the number of disk accesses, BigTable uses Bloom filters.

2. Consistent Hashing

Consistent hashing allows you to scale easily which allows efficient ways to replicate data allowing for better availability and fault tolerance.

Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item’s position. The node associated with the node is the location of the data item.

The principle advantage of consistent hashing is incremental stability; the departure or arrival of a node into the cluster only affects its immediate neighbors and other nodes remain unaffected.

3. Quorum

In a distributed environment, a quorum is the minimum number of servers on which a distributed operation needs to be performed successfully before declaring the operation’s overall success.

Cassandra, to ensure data consistency, each write request can be configured to be successful only if the data has been written to at least a quorum (or majority) of replica nodes.

For leader election, Chubby uses Paxos, which use quorum to ensure strong consistency.

Dynamo replicates writes to a sloppy quorum of other nodes in the system, instead of a strict majority quorum like Paxos. All read/write operations are performed on the first NN healthy nodes from the preference list, which may not always be the first NN nodes encountered while walking the consistent hashing ring.

4. Leader and Follower

To achieve fault tolerance in systems which manage data, the data needs to be replicated on multiple servers.

Select one server among the cluster as leader. The leader is responsible for taking decisions on behalf of the entire cluster and propagating the decisions to all the other servers.

Clusters of three to five nodes, like in the systems which implement consensus, leader election can be implemented within the data cluster itself without depending on any external system. Leader election happens at server startup. Every server starts a leader election at startup and tries to elect a leader. The system does not accept any client requests unless a leader is elected.

5. Heartbeat

The Heartbeat mechanism is used to detect if an existing leader has failed, so that a new leader election can be started.

6. Fencing

In a leader-follower setup, when a leader fails, it is impossible to be sure that the leader has stopped working. For example, a slow network or a network partition can trigger a new leader election, even though the previous leader is still running and thinks it is still the active leader.

Fencing is the idea of putting a fence around a previously active leader so that it cannot access cluster resources and hence stop serving any read/write request.

The following two techniques are used:

Resource fencing: The system blocks the previously active leader from accessing resources needed to perform essential tasks.
Node fencing: The system stops the previously active leader from accessing all resources. A common way of doing this is to power off or reset the node.

7. Write-ahead Log (WAL)

Write-ahead logging is a sophisticated solution to the problem of file system inconsistency in operating systems. Inspired by database management systems, this method first writes down a summary of the actions to be performed into a “log” before actual writing them to the disk. In the case of a crash, the OS can simply check this log and pick up from where it left off.

8. Segmented Log

Split log into multiple smaller files instead of a single large file for easier operations.

A single log file can grow and become a performance bottleneck while its read at the startup. Older logs are cleaned up periodically and doing cleanup operations on a single huge file is difficult to implement.

Single logs are split into multiple segments. Log files are rolled after a specified size limit. With log segmentation, there needs to be a straightforward way to map logical log offsets (or log sequence numbers) to the log segment files.

9. High-Water mark

Keep track of the last log entry on the leader, which has been successfully replicated to a quorum of followers. The index of this entry in the log is known as the High-Water Mark index. The leader exposes data only up to the high-water mark index.

Kafka: To deal with non-repeatable reads and ensure data consistency, Kafka brokers keep track of the high-water mark, which is the largest offset of a particular partition share. Consumers can see messages only until the high-water mark.

10. Lease

A lease is like a lock, but it works even when the client goes away. The client asks for a lease for a limited period, after which the lease expires. If the client wants to extend the lease, it can renew the lease before it expires.

Chubby clients maintain a time-bound session lease with the leader. During this time interval, the leader guarantees to not terminate the session unilaterally.

11. Gossip Protocol

Gossip protocol is peer-to-peer communication mechanism in which nodes periodically exchange state information about themself and about other nodes they know about.

Each node initiates a gossip round every second to exchange state information about themselves and other nodes with one other random node.

Every second each server exchanges information with one randomly selected server

12. Phi Accrual Failure Detection

This algorithm uses historical heartbeat information to make the threshold adaptive. Instead of telling if the server is alive or not, a generic Accrual Failure Detector outputs the suspicion level about a server.

Cassandra uses the Phi Accrual Failure Detector algorithm to determine the state of the nodes in the cluster.

13. Split-brain

The common scenario in which a distributed system has two or more active leaders is called split-brain.

Split-brain is solved through the use of Generation Clock, which is simply a monotonically increasing number to indicate a server’s generation.

Every time a new leader is elected, the generation number gets incremented. This means if the old leader had a generation number of ‘1’, the new one will have ‘2’. This generation number is included in every request that is sent from the leader to other nodes. This way, nodes can now easily differentiate the real leader by simply trusting the leader with the highest number.

Kafka: To handle Split-brain (where we could have multiple active controller brokers), Kafka uses ‘Epoch number,’ which is simply a monotonically increasing number to indicate a server’s generation.

HDFS: ZooKeeper is used to ensure that only one NameNode is active at any time. An epoch number is maintained as part of every transaction ID to reflect the NameNode generation

14. Checksum

In a distributed system, while moving data between components, it is possible that the data fetched from a node may arrive corrupted.

Calculate a checksum and store it with data.

To calculate a checksum, a cryptographic hash function like MD5, SHA-1, SHA-256, or SHA-512 is used. The hash function takes the input data and produces a string (containing letters and numbers) of fixed length; this string is called the checksum.

When a system is storing some data, it computes a checksum of the data, and stores the checksum with the data. When a client retrieves data, it verifies that the data it received from the server matches the checksum stored. If not, then the client can opt to retrieve that data from another replica.

HDFS and Chubby store the checksum of each file with the data

15. CAP Theorem

CAP theorem states that it is impossible for a distributed system to simultaneously provide all three of the following desirable properties:

Consistency (C), Availability (A) and Partition Tolerance(P)

According to the CAP theorem, any distributed system needs to pick two out of the three properties. The three options are CA, CP, and AP.

Dynamo: In CAP theorem terms, Dynamo falls within the category of AP systems and is designed for high availability at the expense of strong consistency.

BigTable: In terms of the CAP theorem, BigTable is a CP system, i.e., it has strictly consistent reads and writes.

16. PACELEC Theorem

The PACELC theorem states that in a system that replicates data:

if there is a partition (‘P’), a distributed system can tradeoff between availability and consistency (i.e., ‘A’ and ‘C’);
else (‘E’), when the system is running normally in the absence of partitions, the system can tradeoff between latency (‘L’) and consistency (‘C’).

The first part of the theorem (PAC) is the same as the CAP theorem, and the ELC is the extension. The whole thesis is assuming we maintain high availability by replication. So, when there is a failure, CAP theorem prevails. But if not, we still have to consider the tradeoff between consistency and latency of a replicated system.

17. Hinted Handoff

In case nodes are down, the system keeps hints (or notes) of all the requests they have missed. Once the failing node recovers, the requests are forwarded to them based on the stored hints.

When a node is down, the leader writes a hint in a text file on local disk. This hint contains the data along with the node information to which it belongs. When the leader realizes that a node for which it is holding hints has recovered, it forwards the write request for each hint to the node.

18. Read Repair

In Distributed Systems, where data is replicated across multiple nodes, some nodes can end up having stale data.

Repair stale data during the read operation, since at that point, we can read data from multiple nodes to perform a comparison and find nodes that have stale data. This mechanism is called Read Repair. Once the node with old data is known, the read repair operation pushes the newer version of data to nodes with the older version.

Cassandra and Dynamo use ‘Read Repair’ to push the latest version of the data to nodes with the older versions.

19. Merkle Trees

Read Repair removes conflicts while serving read requests. But, if a replica falls significantly behind others, it might take a very long time to resolve conflicts.

A replica can contain a lot of data. Naively splitting up the entire range to calculate checksums for comparison is not very feasible; there is simply too much data to be transferred. Instead, we can use Merkle trees to compare replicas of a range.

A Merkle tree is a binary tree of hashes, where each internal node is the hash of its two children, and each leaf node is a hash of a portion of the original data.

Comparing Merkle trees is conceptually simple:

Compare the root hashes of both trees.
If they are equal, stop.
Recurse on the left and right children.

For anti-entropy and to resolve conflicts in the background, Dynamo uses Merkle trees.