baseds
Published in

baseds

Modes of Failure (Part 1)

Modes of failure: how to identify different types of failure (part 1!)

How we talk about failure

Failure recap: how we go about classifying modes of failure.

We can classify failures based on how they are perceived by the rest of the system; the way that the other nodes in a system view or perceive a failure can help us understand what kind of failure it is.

The main modes of failure that we’ll discuss.

Waiting for an untimely response

Timing/performance failures in a distributed system.
  1. The node could take too long to deliver a response (exceeding the upper bound of the expected time interval), but it could also deliver a response earlier than expected, too (exceeding the lower bound of the expected time interval).
  2. The node is actually delivering the correct value — that’s not the unexpected part of the failure; what’s unexpected is the amount of time it took to deliver the correct value!
Omission failures in a distributed system are a kind of timing failure.

Omitted replies (or, tfw things come crashing down)

The two main forms of omission failures.
Crash failures in a distributed system.

Resources

  1. Fault Tolerance in Distributed Systems, Sumit Jain
  2. Failure Modes in Distributed Systems, Alvaro Videla
  3. Distributed Systems: Fault Tolerance, Professor Jussi Kangasharju
  4. Understanding Fault-Tolerant Distributed Systems, Flaviu Cristian
  5. Failure Modes and Models, Stefan Poledna
  6. Fault Tolerant Systems, László Böszörményi

--

--

Exploring the basics of distributed systems, every alternate Wednesday, for a year.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store