Runbooks are Toil

Jamie Allen
Site Reliability Engineering Leadership
2 min readDec 22, 2020

Everyone knows we need Runbooks, and nobody would ever dispute their value. Runbooks provide prescriptive instructions for what to do when an incident occurs in an application, service or system. They also capture a lot of tribal knowledge that otherwise would be difficult to communicate to new team members. They must be included in any Definition of Done, as they are instrumental to keeping your Mean Time to Repair (MTTR) low.

As a side note, if you’re new to writing Runbooks and want a template for starting out, check out Caitie McCaffrey’s awesome template on GitHub.

Some runbooks aren’t automated, for whatever reason. And when that is the case, it’s impossible to ensure in the CI/CD build/release process that the runbook has been updated to reflect any changes that were made in the system in that commit that impact the steps to take if an incident occurs. The goal should be to automate all runbooks.

But that doesn’t solve the problem entirely. Even automated runbooks are often scripts, written in Bash or a dynamic language like Python or Ruby. If that is the case, it is imperative that you implement tests that execute the runbook script and validate its ability to mitigate incidents as part of your build/release process. Some engineers are moving to compiled languages like Rust, which would at least catch typed API changes (if not correctness) of the runbook by being recompiled as a dependency of the original source change. But Python is the Lingua Franca of SRE, so I don’t expect the usage of Rust to become widespread for Runbook automation.

Write tests for your Runbooks, so that new members of your team aren’t left wondering why the automation doesn’t work in the face of an incident.

--

--

Jamie Allen
Site Reliability Engineering Leadership

SRE CTO. Ex-Software engineering leader behind Starbucks Rewards and MOP. Ex-Facebook SRE leader.