System Design Refresher — Part 1, Basic Concepts

Published in

Mastering the System Design Interview

9 min readApr 15, 2021

System Design Refresher series review fundamental knowledge about system design. I hope this series could serve as a good refresher for you before going into a system design interview. This series consists of 4 parts:

Part 1: Basic Concepts in Distributed System
Part 2: Data Replication (TBD)
Part 3: ACID Transaction
Part 4: Distributed Transaction

Terminology:

READ: a read operation, READ(k) read variable k
WRITE: a write operation, WRITE(k, v) write variable k with value v

What’s distributed system?

Why do we need to design distributed systems? Why can’t we just build applications all in one machine? The primary motivation is to provide service to billions of users on earth:

Provide reliable service: If one machine is unavailable or crashes, use another. Protect against corrupted data.
Scale the system as the number of users grows: One machine does not have enough CPU/memory/disk capacity to serve all users at the same time.
Serve users distributed across the globe: Place machine near users to cut down network latency. One machine obviously can only be placed in one place.

To achieve these goals, we need to replicate data to many machines. Whenever we add a second machine, we have a distributed system. What makes distributed system different from a single machine?

**Figure 1.** Distributed System with two replicas. Failure can happen independently anywhere in the system- e.g, OS, App, hardware, network links, etc.

When we write a program to run on a single machine, if any failure does happen, it’s safe to assume that everything else will fail too. If disk seek fails, even though the CPU and memory are perfectly fine, we can assume the whole computer is dis-functional. Technically, we say that they all share fate. Fate sharing cuts down immensely on the different failure modes that an engineer has to handle. However, in a distributed system, all components fail independently. The system only fails partially, never crashes completely. If the network switch in one rack fails, machines on that rack are partitioned from other racks, but all machines are still functional. Because there are so many components in a distributed system (e.g., computer machines, network cables, switches, routers, etc.) there are “infinite” permutations of failures to handle. This blog from Amazon Builder’s Library has a good discussion on different failure modes.

To make things worse, we could never know what has failed. If a client sends a request to a remote server but didn’t get a response. What could be the problem? It could be network congestion, or the packet could be lost in the network and never reach the remote server, or the remote server could crash before sending a reply, or the remote server is too busy to process the request, or the reply message is sent out but lost in the network.

From the client side, if we couldn’t receive a response within a timeout, we can assume something failed. However, we could never know what happened except observing a high network latency to get a response. With this interesting observation, we can use network latency to model different modes of failure:

Transient network congestion, server-side overloading, GC pause, etc = long-tail network latency (~50ms).
Network packet loss = high latency for several TCP retransmissions until successful delivery (~60s).
Network partition, hardware failure, server process crash = large delay up to the duration for repair. (~10min).

Consistency Model

Partial failures create the problem of data inconsistency. When we replicate data across two machines, after updating the state on the first machine, the network could fail to send data to the second machine. Now the two machines have inconsistent data. The client needs to be extra careful when reading inconsistent data. There are “infinite” permutations of failures as such. How can we guarantee the result always correct? To make the system easy to use, we need to establish a well-defined contract between the client and the system. This contract is often called the consistency model.

The strongest consistency model is called “linearizability”. If a distributed system claim to be linearizable, it behaves as if it’s a single computer. We, as a client, observe that all WRITE and READ happen atomically as if it’s writing/reading a variable in memory. The system implementation hides all kinds of weird failures as we discussed above.

A linearizable system is the dream world of all developers. Of course, linearizability is difficult to implement or let’s say it’s very costly. To achieve linearizability, we need to sacrifice availability, scalability, throughput, and latency. Understanding these trade-offs is a fundamental skill for us. We will discuss more in other articles. Here, let’s first define “linearizability” formally:

There exists a total order on all operations such that each operation looks as if it were completed (atomically) at a single instant

“As if” does the magic here! In reality, operations could overstep each other. However, the clients can’t observe such overstepping. The end state looks “as if” all operations complete atomically. All clients observe the progression of the state of the system “as if” all operations are executed in sequential order. This definition is from the client’s perspective. It’s independent of concrete system implementation.

As shown in Figure 2, a database is replicated to A, B, and C. All operations from Client X, Y, and Z happen atomically (one after another) in sequential order. Although the system has three replicas, it behaves as if it is a single machine. WRITE from X is immediately replicated to all replicas and visible to clients Y and Z. Linearizability ensures that a “stale read” would never occur, (e.g., the 3rd operation, Z READ(K)=V0, would never happen). Once X updates K to V1, we are guaranteed to read V1 until X update it to a new value V2. This is an intuitive model, which makes it very easy for application developers to interact with the system.

Linearizability gives us a strong model to reason about the correctness of distributed systems. If you want to learn more about it, there are good video courses from MIT² and the University of Cambridge¹.

Making Trade-offs

Linearizability is a clean contract that hides all complexity of the distributed system. However, linearizability is not the best fit in many use-cases. Morden web service requires to be always available regardless of failures in the underlying infrastructure. It’s impossible to achieve these goals and still maintain linearizability. We need to adopt a weaker consistency model in favor of higher availability, scalability, throughput, etc.

CAP theorem³ has captured the design trade-off between consistency and availability. What CAP theorem says is that when the system is split into two parts — due to network partition (P) — we can only choose between two options:

Choose Consistency (C) over Availability (A): Reject request until network partition is repaired. The system is not available during this period, but the two sub-partitions remain in a consistent state.
Choose Availability (A) over Consistency (C): Allow sub-partitions to continue processing requests. The system remains available, but data in the two sub-partitions could diverge. The system needs to reconcile diverged data after the network partition is repaired.

CAP theorem defines consistency and availability from a theoretician’s point of view. It does not capture the full design space of a real-world system. Especially, latency is left out in the CAP theorem (See detail in 4, 5, 6 ). To understand how to make design trad-offs in real-world systems, let’s first discuss consistency and availability from a practitioner’s point of view

Consistency

In the CAP theorem, consistency refers to strong consistency (linearizability). The system is either consistent or not consistent. In the real world, consistency is not a binary state. Even if a system is not linearizable, we can still define several shades of consistency. Depends on use cases, a weaker consistency model is often sufficient. Here are several well-known consistency models:

Data-centric Consistency Model:

Linearizability: strong consistency, as discussed in the previous section.
Causal consistency: Causally related writes must be seen by all clients in the same order. Concurrent writes may be seen in different orders on different machines. Causality between operations is often tracked with additional metadata.
Eventual consistency: In absence of updates, all replicas eventually converge to the same state.

Client-centric Consistency Models:

Read My Own Writes: a READ(k) always returns the latest WRITE(k) by that process.
Writes Follow Reads: a WRITE(k) following a READ(k) will take place on the same or more recent version of k.
Monotonic reads: once read, subsequent reads on that data items return the same or more recent values.
Monotonic writes: a WRITE must be propagated to all replicas before a successive WRITE by the same process. In another word, on each replica, write operations from the same client are processed in the same order.

Availability

In the web service industry, availability is tightly related to the Service-Level Agreement (SLA). In the SLA of Amazon S3, Amazon S3 claims it’s designed for 99.99% availability (5 min downtime per month). If the accumulated uptime falls below 99.0% (7 hrs downtime per month) AWS will give 10% credit back. How to categorize a service as up (available) or down (unavailable) at any moment? We usually define it using service operation metrics. Among other metrics, latency is a very important metric. All application clients have a timeout (e.g. 1min). If a service can eventually complete a request after 2min, we can’t say that it’s available because the clients only wait for 1 min. In practice, the service vendor kills a request (aka. load shedding) if it takes too long to process because a late response is useless to the clients anyway.

Availability and latency are related concepts. We say service is available if we receive a valid response within a pre-defined timeout depends on the use case. We say service is unavailable if we receive an error or no response before timeout. When we make trade-offs to favor availability, it could mean improvement in two dimensions:

Under network partition, make the service continue processing requests
Under normal operation (without network partition), make the service respond faster within the client-side timeout threshold.

As discussed in the previous section, network partition can also be modeled as high network delay. Then we can unite the two dimensions into one: higher availability means that operation latency is lower and less sensitive to network delay. We can gauge system availability by how system network delay affects operation latency observed by the client. Here are some examples of how network delay could affect operation latency:

Delay independent: operation only depends on local state updates. The system is highly available.
LAN-delay dependent, WAN-delay independent: before completing an operation, the server chitchat with peers through LAN.
WAN-delay dependent: Different scopes of the network (e.g. LAN, WAN) have different average delay and distribution. WAN-delay is several multitudes higher than LAN-delay. If the correct operation of a system depends on chitchat across WAN, it’s less available.

Conclusion

In this article, we discussed that components fail independently in a distributed system. We created a simple mental model — use network latency to model all kinds of failures in a distributed system. Also, we discussed consistency models as a contract between the distributed system and the client. When designing a real-world system, we need to make smart trade-offs between consistency and availability. Consistency and availability are not binary properties. Making trade-offs between them is about exploring design decisions on a continuous spectrum. Due to the size limit, we only focused on basic concepts. In Part 2, we will discuss implementation techniques to make trade-offs between consistency and availability.

Reference

Linearizability, University of Cambridge,
Martin Kleppmann: https://www.youtube.com/watch?v=noUNH3jDLC0
Linearizability, MIT 6.824 (Lecture 7, 1:04:14), Robert Morris: https://www.youtube.com/watch?v=4r8Mz3MMivY&t=3852s
Seth Gilbert and Nancy Lynch. (2002). Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33. 51–59. 10.1145/564585.564601
Daniel Abadi. (2012). Consistency Tradeoffs in Modern Distributed Database System Design: CAP is Only Part of the Story. Computer 45. 37–42. 10.1109/MC.2012.33
Kleppmann, Martin. (2015). A Critique of the CAP Theorem.
Brewer, Eric. (2012). CAP Twelve years later: How the “Rules” have Changed. Computer. 45. 23–29. 10.1109/MC.2012.37
Jacob Gabrielson. Challenges with distributed systems, Amazon Builders’ Library
Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (February 2013), 74–80. DOI:https://doi.org/10.1145/2408776.2408794