What is a Runbook and what would an SRE do with it?

Published in

Fylamynt

3 min readOct 29, 2021

What’s a Runbook?

Runbooks have a long history dating back to early systems, where system administrators and operators have systematically described processes for fixing a specific problem. In the context of cloud operations, runbooks are sequence steps taken by SREs to perform a specific task. These tasks may include incident response, cost management, fixing performance bottlenecks, resolving security issues and more.

In this article, we will talk about what constitutes a runbook, and good practices in building and running runbooks for SREs.

Runbook is a Workflow

A runbook is not simply a sequence of steps, but may include logic (e.g. if-else, loops) and other constructs (e.g. wait for a resource). More precisely, a runbook is a workflow that is a directed acyclic graph (DAG) of actions. The arrows determine the logic of the workflow.

An example Runbook is below.

As you can see in the diagram, a runbook can have a sequence of actions that may interact with multiple services. We will talk about some of these actions in detail below.

Runbook for Incident Response

Incident response is one of the common tasks that is performed by SREs. Remember the Facebook outage that happened recently? What would an SRE do to respond to that outage?

Trigger

First, you need a trigger. This may be generated by monitoring services like DataDog, NewRelic or the AWS health service.

Here’s a sample status history from AWS that can be retrieved through AWS health service APIs.

2. Troubleshooting

The next step is for the engineer to determine what happened and what services are affected. This can be done in various ways including running DataDog synthetic tests. Fylamynt’s DataDog integration allows you to do this out of the box.

3. Root cause analysis

It’s often hard to determine the exact root cause of an incident, but SREs have to make an effort to understand various reasons for why the specific outage might have occurred. This is an extension of troubleshooting and uses metrics retrieved from services like DataDog, logs stored in services like Splunk and more.

4. Fix

After determining the right course of action. In the case of the Facebook outage, it may involve updating DNS records with the right information.

Best Practices for Runbooks

Building, running and maintaining runbooks over time is hard. Here, we list best practices to achieve high efficiency in your runbooks.

Codify every action — Don’t do actions manually. Always codify every step that you take in your runbook, even it is as simple as checking a specific service status.
Run securely — Don’t execute runbooks from your laptop with credentials stored insecurely.
Consistent repeatable actions — Run actions in a way that can be consistently repeated by anyone else in your team.
Manage environments — Make sure to understand which environments (e.g. staging vs production) you are working with, before executing runbooks against an environment.
Human in the loop — Even if you have runbook automation, always have a human in the loop at the right times (e.g. to approve a critical action like destroying a resource).
Cleanup — Clean up after main actions are executed.
Post-mortem — Make sure to understand root causes and potential changes to product and infrastructure after the incident has been fixed.

Automating Runbooks

As you can see, following the best practices is often hard to do manually. It’s important to automate your runbooks so that they can be run consistently without errors, while keeping humans in the loop.

Want to find out more about automating your runbooks?

Try Fylamynt for free ->