Mental model for handling system failures: on-call, run-books, etc.

Oct 22 · 3 min read

I found this mental model to be helpful in conversations about management of system failures. While concepts like fault-tolerant systems, automated operations, on-call rotation, pagers, run-books are familiar to most, it is useful to have a picture showing how they fit together and relate to each other. For example, the model helps to communicate more precisely what the team will focus on to reduce the engineer’s context switches due to being pulled into handling incidents while not on-call.

An escalation path of failure handling:

  • Failure is handled by the system itself. An example would be a distributed system operating on multiple servers may tolerate failure of one of the hosts, without any external (to the system) intervention, as in the case of losing a broker by Apache Kafka.
  • Failure is handled by an external system without a human intervention. For instance, a system that stopped responding to API calls may be restarted by an external supervisor program, as in the case of restarting a failed pod by Kubernetes.
  • Failure is handled by an on-call person following simple steps. Typically, such handling require light human judgement that is hard to automate, or the handling has not been automated yet for other reasons. The good rule of thumb is that the steps can be described in a short run-book that any member of the team can follow. To illustrate, an on-call person gets a page at night indicating that data processing pipeline lagging behind too much, and uses the run-book, which instructs to restart the pipeline or increase its resources based on the pattern on the lag size chart.
  • Failure is handled by a person who is an expert in the system that failed. Such failures handling requires deep understanding of the system that an active on-call person may lack (and may not be required to have). This usually results in an expensive context switch for the expert, or may even escalate to an all-hands-on-deck situation. The failure could be previously unknown, or the the run-book may exist but is intended for use by only an expert in the system. For example, a bug can put a distributed system in a split-brain state from which it is hard to recover without a data loss unless someone who is familiar with the source code is handling it.

The typical ways to de-escalate failures handling:

  • An expert writes a (missing) run-book to be used by a person on-call or to be further automated.
  • An expert implements automated handling of the system failure.
  • An expert improves the system to tolerate the failure.
  • A run-book intended for use by a person on-call is automated to not require human intervention.

When it comes to reducing operational load for humans, in addition to preventive de-escalation of failures handling, it is important to reduce the noise or avoid it altogether. Dealing with incident that doesn’t have to be dealt with, or with the one that is not an incident at all, has the least return on investment. All pagers should be actionable.

The desired balance between failure handling modes may differ depending on the team or a system, but it is important to keep it in check. This model helps to structure the conversation around it, bring everybody on the same page about the imbalance and even organize a more formal monitoring of it, such as by quantifying the engineering resources spent in each group.

Follow at @abaranau.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade