Distributed systems are defined by the legendary Leslie Lamport as:
A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.
Facetious remarks aside, when talking about distributed systems we mean systems in which multiple machines take part in solving a problem. The field is complex and somewhat abstruse, making it difficult for people new to it to see the connection between theory and practice. Most college courses, in particular, focus on the theoretical aspects, lacking emphasis on how the concepts learned can be leveraged to build practical systems.
Yet the importance of understanding, if not mastering, distributed systems is more important today than ever before. Today’s big tech companies (Google, Facebook, Twitter, Amazon, etc.) operate systems that handle vast amounts of data. But even startups and small companies often need to design systems that can scale seamlessly as the number of users increases.
The aim of this post is to provide a (relatively) beginner-friendly collection of resources, mostly original papers, with about a paragraph of commentary on the high-level ideas presented in it. The papers are arranged in what seems to me an intuitive way, starting with the foundations, and progressing all the way to real-world systems.
Other reading lists of this sort have surfaced throughout the years, such as the ones from Cloudera’s Henry Robinson, and AWS’s Swami Sivasubramanian and Marc Brooker, with whom this list will have many recommendations in common. The reader is highly encouraged to consult the posts of these distinguished industry practitioners.
Needless to say, the post makes no pretense of being comprehensive, and there are many outstanding papers and resources that have been left out (such as Byzantine failures), since it would have otherwise become unwieldy. Furthermore, no blog post can be an adequate substitute for exposure to distributed systems in industry (or possibly academia). I hope, however, that this post, with its focus on concepts and their applications, will provide a good place to start for someone new to the field.
What makes distributed systems special
A good way to start with distributed systems is to get a feeling for what makes them so different from single-box systems.
- A Note on Distributed Systems: An excellent paper (from 1994), out of Sun Microsystems, a company renowned for distributed systems work. It pokes holes in the idea that local and remote interactions between objects can be treated in a unified way. A particularly important misconception is cleared up in the paper: namely, that latency is, counterintuitively, not among the most pressing problems in distributed systems. As any industry practitioner will agree, failures (in particular partial failures) pose a far greater challenge.
- Notes on Distributed Systems for Young Bloods: A great collection of practical technical advice on building distributed systems. It touches on a lot of common mistakes, and suggests the right way to go about distributed systems implementation. Among the more important tips are: avoiding process coordination whenever possible, implementing backpressure, and feature flags. A particularly important point is regarding metrics, specifically the preference for percentiles as opposed to the more intuitive averages.
- Fallacies of Distributed Computing Explained: This paper gives a high-level analysis of the well-known Fallacies of Distributed Computing. Much like the preceding two entries, it analyzes common misconceptions in distributed systems, among them assuming zero latency, a secure network, or network reliability.
- Distributed Systems for Fun and Profit: An concise introductory e-book on distributed systems, it has great chapters on time and ordering, as well as replication. In the section on time, it provides code snippets for Lamport and vector clocks. In the replication section, it gives a thorough comparison of replication approaches, and the trade-offs between consistency, latency, and throughput.
After the motivational section, the next couple of papers present some of the basic algorithms and concepts used in the design and implementation of distributed systems. They cover some of the basic problems that arise from the presence of multiple machines, namely the notion of time and ordering, routing (consistent hashing), and membership/failure detection.
- Time, Clocks, and the Ordering of Events in a Distributed System: The paper, by Leslie Lamport, addresses the problem of ordering events in distributed systems. This is important since we usually cannot rely on a global clock to provide an absolute ordering. The paper proposes the method of logical clocks, which generates a partial order of the events. Their generalization, vector clocks, are used in numerous real-world systems.
- Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications: While Chord, a relatively old peer-to-peer service, could hardly be called foundational, the paper is included here for an excellent introduction to consistent hashing. The key idea in consistent hashing is arranging machines in a virtual ‘ring’. The owner machine for a key is then determined by mapping the key to a position on the ring (usually using the modulo operation). This kind of routing is frequently used, and we’ll see examples in the ‘Putting it all together’ section. As an interesting aside, a more thorough treatment of Chord’s correctness guarantees is available here.
- A Gossip-style Failure Detection Service: An important problem in any system is detecting the failure of member nodes. While it’s possible to solve this by using a service such as ZooKeeper, but this becomes problematic as the system scales and changes in membership become far more frequent. The paper in question presents a gossip-based protocol for failure detection, in which nodes exchange their respective views of the current status of other members.
Consensus and Locking
Consensus is one of the most important problems in distributed systems, both in theoretical and practical aspects. The basic formulation of consensus is, roughly-speaking, getting a set of machines to agree on a value. Yet its most common use is state-machine replication, essentially replicating the same sequence of operations across machines in order to ensure fault tolerance.
- Any consensus paper: Paxos is the best-known consensus protocol, and is used frequently in real-world systems. Proposed by Leslie Lamport in the 1980s, it is widely perceived as complex, leading to papers with names such as Paxos Made Simple. A walk-through is available from the Paper Trail blog. Raft, another consensus protocol has been devised specifically to address this drawback. At the risk of sounding blasphemous, knowing the details of any particular consensus protocol is less important than intuitively understanding its use as a building block in distributed systems. We’ll go through examples in the ‘Putting it all together’ section.
- The Chubby lock service for loosely-coupled distributed systems: This paper describes Chubby, Google’s internal lock service, used in some of their best-known distributed systems, such as BigTable and the Google File System (GFS). Despite the somewhat counter-intuitive design decision of exposing locking as a primitive, it’s difficult to overstate its importance. The main use case is for electing leaders in leader-elected distributed applications. ZooKeeper is a widely used open-source system inspired by Chubby.
While the focus of this post is on the practical aspects of distributed systems, a couple of theorems are nonetheless important for analyzing existing systems.
- The CAP theorem: Formulated by Eric Brewer in the late 90s, it postulates the tradeoff between availability, consistency and partition tolerance (tolerance of network splits) in distributed systems. This plain English introduction to the theorem, with a practical example, is excellent. It is worth noting, however, that both the consistency and availability definitions used in the theorem are quite strong, and it’s therefore not unusual for real-world systems to fail to meet either of them. We should also keep in mind that the availability definition in CAP does not take into account latency at all, and that the definition of network partition doesn’t cover more general failure scenarios. A more formal treatment is available in Gilbert & Lynch.
- The FLP theorem: One of the fundamental theorems in distributed systems, the FLP paper (named after its authors), proves the impossibility of distributed consensus with one faulty process in an asynchronous system model (where messages can be arbitrarily delayed). While the theorem might seem to contradict the algorithms for consensus presented above, it’s important to note that the theorem only shows that it’s impossible to devise an algorithm that is guaranteed to reach consensus in every possible case. Furthermore, the system model used is less rich than in the real world. A more approachable walk-through is available here.
Putting it all together
After studying the preceding papers to get a reasonably good theoretical understanding, we can put all of the concepts together and start studying real-world distributed systems. As expected, most of the systems, as well as the corresponding papers, come from the giant tech companies.
- Dynamo: Amazon’s Highly Available Key-Value Data Store: Of all the industry papers called out in the list, this one probably gives the most bang for the buck. While it doesn’t introduce any new concepts, it ties them all together into a real-world system. The paper contains such important concepts as consistent hashing, Merkle trees, vector clocks, quorums, etc. which themselves can serve as important building blocks for other large-scale systems. It’s also notable for introducing a system that sacrifices consistency for availability (pushing reconciliation to the client). Dynamo itself is a distributed key-value database that has achieved great popularity. Note that there are significant differences between the paper and what AWS offers as DynamoDB, most importantly regarding reconciliation. Cassandra is a riff on this idea, though it relies on timestamps for conflict resolution rather than vector clocks.
- MapReduce: Simplified Data Processing on Large Cluster: Introduces the MapReduce system, as well as the programming model itself. Originating from functional programming, the model is based on two operations, map (which applies a transformation to the input data) and reduce (which aggregates the transformed data). The beauty of the model is that since each of the records in the input data set is independent, applying map to the input set can be distributed across a number of hosts, thereby parallelizing the computation. The paper starts with a simple example, calculating the words counts in a collection of documents, and touches on a couple of other interesting use cases, such as distributed grep, inverted indexes, and distributed sorting.
- Kafka: a Distributed Messaging System for Log Processing: Describes the architecture of Kafka, one of the most popular distributed message queue systems today, developed at LinkedIn. While the paper dives into the architecture and lists a couple of use cases, a more thorough exposition of the philosophy behind Kafka and its design, influenced by transaction logs in database systems, can be found in a post by co-creator Jay Kreps.
- Spanner: Google’s Globally-Distributed Database (advanced): Spanner is Google’s semi-relational database which is distributed across multiple data centers for increased availability and durability. The paper contains a lot of concepts: data is sharded across multiple Paxos clusters, with two-phase commit on top of those clusters to provide support for transactions. One of the more striking facts about the system is the way in which it handles time and ordering. While most systems work around the unreliability of clocks using such techniques as vector clocks, Spanner uses Google’s TrueTime API, which relies on GPS and atomic clocks. Spanner was recently released as part of the Google Cloud Platform. Its guarantees in terms of both availability and consistency are so strong that a separate paper was published explaining Spanner’s relation to the CAP theorem (by Brewer himself). In terms of real-world systems, CockroachDB is heavily inspired by the design.
Blogs to Follow
We’ve so far covered a lot of seminal papers, but the best way to keep up to date with developments in distributed systems is to follow the blogs of industry experts. A couple of blogs to start with include:
- https://blog.acolyer.org/ (the morning paper): The author (a former CTO of multiple companies, including VMWare) publishes a short summary of selected papers in computer science, on an (almost) daily basis. Though the papers he selects cover a broad area of computer science, there are also entries on a lot of distributed systems papers, including a lot of the ones listed in this post (e.g. Paxos made live, or Chubby). An excellent blog, even if your schedule prevents you from following it every day.
- http://highscalability.com/: A blog analyzing the architecture of many real-world distributed systems in a rather detailed way. Good examples include WhatsApp, Twitter, and YouTube. Even if the posts linked are from a couple of years ago, they still provide valuable tips about designing and implementing a large-scale distributed system.
- https://www.allthingsdistributed.com/: Written by Werner Vogels, CTO of AWS. A former professor and researcher in distributed systems (and co-creator of Dynamo), Werner covers plenty of technical topics on the blog, such as eventual consistency. He also provides extensive commentary on newly introduced AWS services and offerings (recently Amazon SageMaker).
- https://aphyr.com/posts: In one talk, the author described himself as ‘the guy who breaks databases’. Less belligerently put, he tests databases and other distributed systems to determine their failure modes. For example, CockroachDB or MongoDB.
Thanks to Alen Rakipovic and Stjepan Glavina for painstakingly reviewing and helping polish the drafts. Special thanks to Marc Brooker for valuable feedback and pointing out a couple of additional resources. Errors and omissions are purely of my own making.