How I am learning distributed systems

5 min readOct 28, 2020

This is an article about how, having previously focused on concurrent systems, I am learning the craft of designing distributed systems.

The web being full of “how to …” type of articles, I’ll start this one from the presumption that what works for me, might not for you, hence the title beginning with “How I am…”.

But don’t worry, I’ll slip into the presumptive style soon enough.

How to start learning about distributed systems?

To start, I’d say you should spend about 4 years working on a large and complicated concurrent system, and this is not meant as a discouragement.

Before working with concurrency, I read about distributed systems, understood the English, and had all sorts of ideas about “scalability”. But, as I later found out, I simply didn’t get the logic. Knowing only sequential programs — MVC web apps where I used a SQL database as a massive lock around my app’s state, while remaining blissfully unaware of it — I lacked intuition about the logic of distributed systems.

After that, I contributed to a large concurrent system for about 4 years, and something started to change only in the fourth year: from then on, when reading about distributed systems, I would understand(some of) the logic of the text. Why would understanding concurrent systems be a good starting point to understand distributed ones?

The relationship between “concurrent” and “distributed”

If a sequential program is one dimensional, adding concurrency to it would add a second dimension, and running copies of this program on multiple computers — making it a distributed system — would add a third dimension.

When writing a concurrent program, the logic of the program needs to be robust to potential parallel execution — on the same machine but with potentially multiple CPUs — of the concurrent units of the program.

When writing a distributed program, the logic also needs to be robust to parallel execution — multiple computers implies multiple CPUs — and on top of that, the logic needs to be robust to those computers crashing or dis-connecting from each other, and later perhaps re-connecting or restarting.

Therefore, experience with concurrency would seem like a sensible prerequisite for distributed: the code you’ll write will almost certainly be concurrent, and the system-wide logic will be concurrent plus.

Now, if you lack experience with concurrency, you can look for a large and complicated concurrent system that is open-source, contribute to it for several years, and then come back to this article.

Those in more of a hurry can also proceed to the next paragraph.

How to really start learning about distributed?

Since 2014, unknown to most, we have been living in a Brave New World: one in which mere mortals can begin to understand distributed consensus. All of that is thanks to Raft, a consensus algorithm whose novelty is being understandable to coders.

Therefore, your quest to learn about distributed systems will start with Raft. More specifically, it will start with a series of articles explaining how to implement Raft in Go, and the repo that comes with it.

I can’t exaggerate the helpfulness of those resources, especially that of the extensive test suite, which will allow you to change the code and ascertain yourself of the (in-)correctness of your changes.

What this will teach you is:

What is consensus.
How to implement it.
How to apply it to solve the problem of replicating state machines

And that give you an intuition for the subject that goes beyond the English found in the Wikipedia entry.

What’s next?

If you are a programmer, Raft probably felt relatively easy to grasp. It provided you with a complete imperative solution to the problem of consensus — the kind of laundry list programmers are comfortable with.

Now it is time to break out of the coding box, and enter the realm of logic and abstract thinking. For that, I will provide you with an imperative laundry list myself:

Learn the basics of TLA+: study the first 7 chapters of “Specifying systems”, and then write and verify some basic models using the TLC toolbox(example).
Peruse “Time, Clocks, and the Ordering of Events in a Distributed System”. This should give you a sense that a distributed system is an abstract entity going beyond a bunch of code.
Go through the excellent CSE 128 Spring 2005 course by Keith Marzullo.
Study “Paxos Made Simple”, and then the original Paxos paper.

At this point you may have an epiphany: a realization that Paxos describes the same concepts as Raft. And yes, leader election in Paxos is left as an exercise to the reader, and that’s ok.

You should also find comfort in this way of thinking. It was not a precise laundry list that you needed, but clarity of thinking about the topic at hand.

If you haven’t reached this way of thinking by the time you reach the last item, go to 1 and continue looping until that condition has been met.

The end, or the beginning?

I taught myself programming 10 years ago.

Since then, every five years I have been running into something new that feels like having to learn programming all over again. After using Python for five years — using Django to write MVC web applications — I learned Rust and programming in the large.

And now, five years later, I find learning TLA+ just as hard. The hard part is not the new syntax, but learning to think in a different way. In this particular case, learning to think in terms of logic — in the form of simple math — as opposed to code.

So, you may still be wondering, how to learn about distributed systems? Although I hope this article can help, you’re going to have to figure it out for yourself. To inspire you, I’d just like to quote Leslie Lamport, who, in the introduction to his book “Specifying systems”, wrote:

If your exposure to C++ hasn’t destroyed your ability to think logically, you should have no trouble filling any gaps in your mathematics education.

and change it into:

If your exposure to coding hasn’t destroyed your ability to think logically, you should have no trouble learning distributed systems.

and finally, thank you for reading.

Interested in distributed systems? Read more of my articles:

Understand Paxos with Rust, Automerge, and TLA+ — Part 1: The Synod.

What a computing device does next depends on its current state, not on what steps it took in the past. Leslie Lamport…

medium.com

Understand Viewstamped Replication with Rust, Automerge, and TLA+

Viewstamped Replication is the underdog of consensus algorithms: invented just before Paxos, and revisited in 2012, it…

medium.com

Distributing Lamport’s bakery with Automerge, and a touch of TLA+

Leslie Lamport discovered the the Bakery Algorithm in 1974 as an alternative solution to Edsger W. Dijkstra’s mutual…

medium.com

Want to start learning about programming(and not just coding)? Read my series of articles, starting with the foreword.

Python for Youngsters

Foreword: Programming in the Age of Artificial Intelligence

medium.com

How I am learning distributed systems

How to start learning about distributed systems?

The relationship between “concurrent” and “distributed”

How to really start learning about distributed?

What’s next?

The end, or the beginning?

Understand Paxos with Rust, Automerge, and TLA+ — Part 1: The Synod.

What a computing device does next depends on its current state, not on what steps it took in the past. Leslie Lamport…

Understand Viewstamped Replication with Rust, Automerge, and TLA+

Viewstamped Replication is the underdog of consensus algorithms: invented just before Paxos, and revisited in 2012, it…

Distributing Lamport’s bakery with Automerge, and a touch of TLA+

Leslie Lamport discovered the the Bakery Algorithm in 1974 as an alternative solution to Edsger W. Dijkstra’s mutual…

Python for Youngsters

Foreword: Programming in the Age of Artificial Intelligence

Written by Gregory Terzian