How I am learning distributed systems

The Web being full of “how to …” type of articles, I’ll start this one from the presumption that “what works for me, might not for you”, hence the title “How I am…”.

This is an article about how I am currently in the process of being enlightened with regards to the craft of designing distributed systems, having the last few years focused mostly on concurrency(on a single machine).

But don’t worry, I’ll slip into the presumptive style soon enough.

How to start learning about distributed systems?

Requirement number one: spend about 4 years working on a large and complicated concurrent system.

This is not meant as a discouragement, it’s just my own experience.

Just four years ago, I remember reading all sorts of articles about distributed systems, and while I understood the English, I simply didn’t get the logic.

The reason: up to that point, I had only worked on sequential programs, in particular those MVC web apps where you use a SQL database as a massive lock around all your app’s state(and you’re blissfully unaware of that fact).

After that, I contributed to a large concurrent system for about 4 years. And I think it’s only in the fourth year that something started to change.

What changed is that when reading about a distributed system, I actually understood the logic of what I was reading. That’s not saying I completely understood everything, just that, apparently, some intuition had been build up over the previous years, and it helped(it also made me realize how I didn’t get it previously).

The relationship between “concurrent” and “distributed”

There are two things binding concurrency and distributed:

  1. The code running on a single machine as part of a distributed systems is almost always concurrent. You usually need to interface with the network, run some separate business logic, run some timers, and so on. Stuff where concurrency is usually a handy tool.
  2. A distributed system is inherently concurrent, but it is also something more: a system where the concurrent units(themselves containing further concurrent sub-units, see 1) can go away, loose their state, and rejoin at any time.

If a sequential program is one dimensional, then a concurrent one would be two dimensional, and a distributed one would add a third dimension.

How so?

When writing a concurrent program, the logic needs to be robust to the potential parallel execution, on the same machine but with potentially multiple CPUs, of the concurrent units of the program.

When writing a distributed program, the logic also needs to be robust to the, now most definite(since by definition different machines represent different CPUs), parallel execution of different program units. And on top of that, the logic needs to be robust to those units of execution crashing independently, or dis-connecting from each other(and later maybe re-connecting).

It’s basically “concurrency” + “deal with crashes” + “deal with network dis-/re-connects”(to which might be added “deal with malicious or erroneous participants”, but I haven’t gotten to that part yet).

So, experience with concurrency would seem like a sensible prerequisite for distributed, since the code you’ll write will almost certain be concurrent, and the system-wide logic will be “concurrent plus”.

Now, if you currently lack experience with concurrency, you can look for a large and complicated concurrent system, contribute to it for several years, and then come back to this article.

Those in more of a hurry can also proceed to the next paragraph.

How to really start learning about distributed?

Since 2014, unbeknownst to most, we have been living in a Brave New World, one in which mere mortals can begin to understand distributed consensus.

All of that is thanks to Raft, a consensus algorithm that comes with the novel property of being understandable.

The whole “Raft makes consensus finally understandable” is a meme, Paxos was already very simple. Raft may indeed be a bit easier to get started with, at least for programmers used to thinking in procedural fashion. Mathematicians and logicians may still prefer Paxos.

Therefore, your quest to learn about distributed systems will start with Raft.

More specifically, it will start with a series of articles explaining how to implement Raft in Go, and the repo that comes with it.

I can’t exaggerate the helpfulness of those, especially that of the extensive test suite, which will allow you to change the code(always a good way to learn), and ascertain yourself of the (in-)correctness of your changes. It will also be a great way, if like myself you haven’t already, to learn Go, indeed an excellent language for all things distributed.

What this will teach you is:

  1. What is consensus.
  2. How to implement it.
  3. How to apply it to solve the problem of replicating state machines

And that is really useful to in order to get an intuition for the subject that goes beyond the English found in the Wikipedia entry.

What’s next?

Consensus might be one of the fundamental building blocks of distributed systems. A bit like atomic operations are the foundation of concurrent systems, one might say.

And just like in concurrent systems, the large majority of distributed code does/should not use these basic buildings block directly.

For example, when starting work on a concurrent system, the first thing you say is not “let’s write our own mutex!”. And so with distributed systems, you will rarely implement you own consensus service, instead using the API offered by some other component, one that might use consensus internally to provide certain guarantees to it’s users.

You can read a few seminal papers about these kind of components, such as:

  1. Chubby(and the experience of engineering/running it).
  2. Zookeeper.

How can these kind of component be used?

Let’s say you are running a system where one machine is responding to client requests, performing some work, and responding to the client with the results.

Now let’s say you want to cache this result for subsequent client requests, and for bonus you want to make your system very much available and operating with little downtime if at all.

To start caching, you could keep results in memory on the machine performing the work. There, obviously, if the program crashes, you lose the data.

How about storing it on disk? That would work until the disk somehow is destroyed, or until the machine is dis-connected from the network and cannot handle incoming requests. Also, even if the data is not forever lost when the program crashes, if it were to happen your system would still be down until it re-boots.

So one way to improve on the situation would be by replicating a cache of the results on multiple machines. There’s a concept aptly called replication and in this case it could involve having one machine act as “leader”, and have a set of “followers”. Then, the leader would the one handling clients requests, perform the necessary work, and have the followers replicate this work. This could make it both harder to lose the data, and could also result is less downtime since followers could also act as “hot stand-by”.

Note: this could obviously increase the latency to handle a given client request, although it depends on your requirements.

For example, if all you want is to cache(on a “best effort” basis) the results of the work performed by the leader, you could first reply to the client, and then send it to the replicas, thereby not actually influencing the latency of the request(although lowering throughput at bit maybe).

On the other hand, let’s say that once you’ve replied to the client, you absolutely want to make sure the work will not be performed again for the same request, then you would first have to replicate the results, and only respond to the client once, let’s say, a majority of followers have replicated the results. That would definitely increase the latency of handling the client request, but it could allow you to give clients a guarantee that multiple requests for the same resource would be idempotent, or that “work performed” is not lost(let’s say the client wants to increment a counter reliably and consistently).

And how would you do this “leader election”?

You could enthusiastically write the code yourself, for example using your newfound knowledge of Raft.

However that would mean additional error-prone code, and it might also result in extended downtime in practice, since upon leader failure, you would first need a replica to realize the leader is down, and then start an election, and then you would need the election to complete, and only then would you be able to respond to client requests(maybe you could simply start responding without caching, and start caching once the election completes, yeah that’s the sort of stuff you start thinking about…).

Another way could be to use Zookeeper for that. Indeed, this link is to a piece of text that starts with the words: “A simple way of doing leader election with ZooKeeper is to…”.

By the way that page also mentioned a bunch of other things you could do using Zookeeper, and they’re all eerily similar to concepts in concurrent programming. In fact, it almost reads like using Zookeeper is really like “regular” concurrent programming, and that’s mainly because Zookeeper hides the consensus aspect from the programmer.

And for when it gets Kafkaesque…

The next system I recommend looking into is called Kafka. It’s main use is the catchphrase “event streaming”, and saying that it’s like a “distributed message queue on steroid” would be doing it a disservice.

And as you start reading-up on it’s design, it should come as no surprise by now to find out that Kafka is actually using Zookeeper to coordinate certain things.

And the interesting part is the plan to remove that dependency and re-implement it using a tinkered version of Raft.

The design docs around this plan are perhaps some of the most interesting live discussions on “applied distributed systems” you can find out there for now(let me know if you find better).

It also happens to be a great compliment to Raft, since one of it’s design goals is making it easier for people to tweak it while maintaining correctness(because it’s relatively easy to understand, it’s also relatively easy to change without breaking), and a large project like Kafka doing exactly that must mean something.

The beginning or the end

I can’t actually recommend this one yet, since I haven’t read it (beyond freely available previews), it appears absolutely awesome and I’m saving it for the right occasion: “Introduction to Reliable and Secure Distributed Programming”, by Cachin, Guerraoui, and Rodrigues.

Update: I still haven’t bought the book, and in the meantime the lecture notes of a course at Cambridge taught by Martin Kleppmann are absolutely great: https://www.cl.cam.ac.uk/teaching/2021/ConcDisSys/dist-sys-notes.pdf

And with this, we conclude our introduction to how to start learning about distributed systems.

I write in Javascript, Golang, Python, Rust, and English. Always for people to read.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store