A course outline for Distributed Systems

Coder At Work
Student Voices
Published in
4 min readSep 2, 2016

I believe that a majority of large scale distributed systems in use today are of the following type: a collection of independent processes connected by a network and interacting through passing of messages. Processes may fail and restart. Messages may lag, be dropped or duplicated but never corrupted.

Programming model
Thus the model of programming that is most suitable is of asynchronous message passing programs, with “fail-recover” type of fault tolerance. Table 1 shows some of the different process, communication and fault types described in the literature (Nancy Lynch, Attiya & Welch). You can construct different types of distributed systems by picking one kind of property from each row in the table, e,g: synchronous, shared memory, fail recover system.

Table 1:
Process types:
Synchronous / Asynchronous
Communication types: Shared memory / Message passing
Fault types: Fail stop / Fail recover / Byzantine

Algorithms
Within this asynchronous message passing model, it is necessary to have an understanding of algorithms for the following problems: mutual exclusion, consensus, leader election, fault detection and handling, consistency models. The communication complexity of each algorithm needs to be kept in mind, but only so far as to avoid pathological choices, as much of the communication happens within a bounded network.

Formal methods
The complexity and subtlety of distributed algorithms call for more formal methods to verify their design, going beyond manual tracing and writing tests. Process algebras have been invented to reason about them. Languages like TLA+ are starting to be used as well (How AWS uses formal methods).

Emergent behaviour
Beyond correctness, such distributed systems have “emergent” behaviour at scale. These could be related to performance degradation and failure caused by unforeseen changes in system structure or system load, and feedback loops (see section titled “What Formal Specification Is Not Good For” from the above AWS paper). In messaging systems, such issues are usually dealt with using knowledge of queueing theory and statistics. Such knowledge can also be used for capacity planning and keeping the system operating smoothly.

There is also the possibility of using techniques from physics. An exposure to complex system phenomena as they are studied by physics might come in handy.

Team work
For the foreseeable future, large scale distributed systems are going to be built by teams of people co-operating with each other. So the human aspect of software development cannot be ignored. To retain a holistic picture of project dynamics, some aspects of “systems thinking” can be applied.

History and evolution
A passing knowledge of the history of the field is nice to have. If the developer wishes to go beyond what the textbooks provide, there are many collections of papers available online (Christopher Meikejohn, Dan Cres, thread on Quora). However it is advisable not to use them as the primary means for learning about distributed computing.

Implementation
At some point, the above algorithms have to be realized in code. The language of implementation needs to have good support for message passing, asynchronous programming, libraries related to distributed system primitives (Slide 17 from this AWS presentation), networking, messaging, and error handling. It is important not to overlook traditional software engineering concerns like type safety, object orientation, modular design, immutability and so on. The programmer should be able to address relevant systems programming issues as they arise — process scheduling, networking, I/O and so on.

Distributed systems normally work in conjunction with databases and analytics products. So an understanding of the basics of these is going to be useful.

Applicable domains
It is worth asking if there are indeed such systems where all these ideas come into play. Most software work tends to be small and incremental in nature. Hard problems related to distributed systems often arise only when new systems are being conceived and designed. In many such situations off-the-shelf solutions like Hadoop, Zookeeper, HBase etc will suffice. So where does one turn to work on such problems?

  • An obvious answer is the big cloud players (AWS, GCP, Azure) who will simultaneously consolidate market share while continuing to grow in size.
  • Another area where such knowledge is needed is applications with critical reliability and performance needs, e.g: medical systems, avionics (see section The Value of Formal Methods for ‘Real-world Systems’ in the AWS paper), banking.
  • An emerging market for these ideas is services that manage clusters of computers — whether an on premise cloud or on one of the major providers. These services help you provision resources optimally, deploy software and monitor it for performance and cost. Examples are Kubernetes, Hashicorp and Mesos. They are solving a distributed systems problem at their core.
  • Even within application development companies that prefer to use off-the-shelf components, having access to lower level building blocks and the skill to use them well can arguably lead to more innovative ways of combining them, greater customizability of existing systems and consequent agility due to both of the above.
  • New business ideas might emerge where a structured, distributed systems solution can improve upon a hitherto ad hoc, inefficient solution in a dramatic fashion.

Online Courses
For good reason, no single online course can cover all the above topics. Still, some courses that do touch upon the algorithms and protocol parts are the MIT OCW course on distributed algorithms, and Prof. Indranil Gupta’s distributed systems course in UIUC.

--

--