A decade of distributed computing

and a brief tour of technologies


Much has happened in distributed computing in the last 10–15 years. It is worthwhile to learn about these advances to dig beyond the hype of “big data” to understand what these technologies do, how they are used, and where they are going. For a general audience, such learning can build intuition about the emerging capabilities of data engineering and why it has improved our lives in some ways (eg, instant search, Netflix queues, relevant ads) and not others. For a technical audience, reading these papers can help develop intuition for solving related problems.

A lot of the best research is easily accessible and referenced in some great reading lists (Stanford, MIT, CMU, Rice, industry). Because “big data” often implies Hadoop, which at its core is concerned with parallel data processing, let’s start our tour there.

Data processing: It was only in 2004 that Google published the MapReduce paper. The canonical example of how to quickly count how often a search term appears in a collection of documents (such as all the webpages on the internet) involves mapping your query across N servers that have divided up the work, then reducing those N results to a single result list. This is an oversimplified example, as search in reality is much more complicated. But you can see how these kinds of operations play a role in how a search engine works, and this is the basic idea of the MapReduce that has spawned an ecosystem of Hadoop and friends. Nowadays, the new hotness is Apache Spark, a successor to MapReduce first introduced in the Resilient Distributed Datasets paper that offers a 40x speedup over MapReduce due to its in-memory and fault-tolerant design.

File systems: Closely related to and necessary for MapReduce is the distributed file system. The Google File System describes how files are stored, replicated, and appended to. The paper itself is pretty high-level, but I appreciate the insight from the GFS engineers in a followup interview about design choices they have since reversed or revisited, like having only one master server and assuming a large file size (64mb). Just as Google MapReduce inspired Hadoop MapReduce, GFS has inspired HDFS. In both cases, it is much easier to directly access and learn about the open-source variants.

Internally within Google, there is a successor to GFS called Colossus, for which there is very little information published. It would not surprise me if its design eventually influences the direction of the open-source world, but for now the Hadoop ecosystem is still learning to drive business value from HDFS architectures.

Databases: Distributed databases, especially non-relational, have proliferated. There are too many successful databases to do them all justice, so I will only mention a few that happen to follow a progression.

Google’s BigTable builds on GFS and it is “a sparse, distributed, persistent multidimensional sorted map.” The example in the paper shows how BigTable can be used to store newer versions of webpages as they are updated. Spanner is “scalable, multi-version, globally-distributed, and synchronously-replicated database.” Reading this paper really makes you feel how planet-scale Google is, particularly the part on how Spanner can move data between datacenters to balance resource usage across datacenters, and how Spanner guarantees globally-consistent reads and writes with its usage of synchronized atomic clocks.

F1 builds on Spanner to combine the scalability of NoSQL databases with the consistency of relational databases. Internally, they migrated some major applications away from MySQL to F1. This sounds like a really awesome database, and it is conceptually intuitive in architecture. I bet the following does not even make your head spin:

Consensus algorithms: Consensus is a fun topic of how multiple servers can elect one among them to be the leader that coordinates how they all keep track of data. This topic is more fun if you start with the very learnable Raft, and less fun if you start with Paxos. In fact, Raft was conceived as a response to how notoriously challenging Paxos is for the brain, and I find that I can describe the Raft protocol even if I have not reviewed it in months. Many of the best things are elegant in design. Implementations of consensus protocols are used in other things we have already mentioned, like BigTable and Spanner.

Consistent hashing: Another use case that may arise when storing data across many servers is efficiently finding which server holds some data of interest. In this context, it is inefficient to check each server one-by-one. Chord is an approach that uses consistent hashing to point the client to the next server in the node ring to check. Each node stores a “finger table” for pointing the client to the right section of the ring. This algorithm is used in Amazon’s Dynamo key-value store. As with many technologies, the story does not end when the paper is published. In the case of Chord, engineers at Amazon used a specification language called TLA+ to find and correct subtle bugs that had until then been missed by reviewers of the algorithm.

Lots more: ZooKeeper is a popular tool for coordinating other distributed servers. Its internals involve both atomic broadcast to reliably send messages to all recipients, and it also uses leader election. (The internal workings of ZooKeeper is something I wish I understood better.) It has been successfully used in a lot of related technologies like Apache Kafka. This article by the creator of Kafka is another great read that focuses on the log, an unassuming afterthought in many papers, as the centerpiece for understanding distributed computing. My favorite quote: “The purpose of the log here is to squeeze all the non-determinism out of the input stream.” Squeeze all the non-determinism, I gotta remember that one.

And there’s lots more, and I haven’t even mentioned Bitcoin.

I don’t always read papers, but this is how to delve into a field with academic focus. It feels extremely encouraging to be able to follow along such reference-heavy talks as “The Road to Akka Cluster and Beyond” by the creator of Akka. I also take further encouragement from Andrew Ng:

“If you seriously study half a dozen papers a week and you do that for two years, after those two years you will have learned a lot…. Chances are what you learned studying all Saturday won’t make you that much better at your job the following Monday. There are very few, almost no short-term rewards for these things. But it’s a fantastic long-term investment.”