Building a Better Ops Runbook

What to do when it’s 3am and the servers are melting down

A runbook is an operational reference which is used to describe an application in a deployed environment. It should be easy to read, consistent across all applications, and accurate. This is the document an on-call responder would refer to at 3am when a SEV1 alert wakes them up, so it should be as straightforward and to-the-point as possible. Although this article assumes that there is a dedicated Operations team, it is equally useful for DevOps teams, system administrators, or just a plain old developer who needs to understand the deployment environment. The runbook is also useful when auditing an application environment to make sure that the appropriate monitoring, backup procedures, or security policies are in place.

It doesn’t matter where you store your runbooks, just make sure they are easy to find, read, and edit. Usually that means putting them in whatever wiki your team or organization currently uses. However, any tool that fits your regular workflow is usually the right tool for the job. I have provided a sample runbook in Markdown format and hosted on GitHub. This will give you a complete example that you can print out or reference later. The rest of this post will discuss each section of the runbook in detail.

Runbook Inventory

Runbook Inventory

In addition to landing page, leverage features of the operating system to direct users to the correct documentation. For example, if the applications are deployed on Unix hosts, the message of the day file can help ensure that admins know exactly where to look:

Message of The Day (/etc/motd)

Tip: Generate a MOTD file for each system. Figlet can be used to generate the ASCII word art. The giant “punch you in the face” font size helps ensure there’s no question about what system you’re logged in to.

And now, let’s get into the details of a runbook…

Anatomy of a Runbook

Support Contacts

Application Contact List

Support is often provided in tiers or levels. For example, Level 1 support might receive all initial reports. Their job would be to validate that the host is accessible from the network and basic services are available. This is most often the role of an IT organization’s on-call staff. Level 2 support would provide more application-specific operational support. They have some understanding of the IT infrastructure but they also have a deeper understanding of the application and can review logs, investigate performance concerns, and troubleshoot application issues. Level 3 would be the application experts, the experts with the most authoritative understanding of the application but also the most costly to contact.

In addition to providing support, members of the contact list are also users of the runbook and should be responsible for reviewing it for correctness. Each member of the contact list should review the runbook on a regular basis (perhaps yearly) and sign-off to confirm that the information is correct and sufficient to allow other members of support to handle incidents.

Overview

  • Links to the application website
  • Vendor information and vendor support contacts (if applicable)
  • General license information and renewal dates
  • Links to any internal documentation or project pages

Architecture

Architecture Diagram

Hosts

List of Application Hosts

Network

List of Network Ports and Protocols

Directory Locations

List of Key Application Directories

Monitoring

Monitoring Information

Hosts should be grouped by function, with direct links into the monitoring system if possible. Monitoring which is specific to that service should be documented, including the monitoring severity (how urgently someone needs to respond) and the type of action that can be taken to resolve the alert. For simple cases, it may be enough to state, “Check logs, restart the service.” However, in more complex situations such as a disk space issue, the reader will need to know what actions can be taken to resolve the issue. The resolution should contain direct links to documentation which describes detailed steps for resolving the alert.

The severity classifications may have specific meaning or a service level agreement (SLA) within your organization, so it’s generally best to use the agreed upon terminology within the runbook and then provide links to the internally recognized definition for the novice reader.

Metrics

Grafana Dashboard

It’s worth noting that metrics and monitoring may not be the same thing. Although you may use a system like Prometheus to provide both metrics and monitoring, it is also possible that the long term storage of these metrics are handled by a separate time series database. For example, data may be collected by Prometheus, but then shipped off to TimescaleDB/Grafana for long term (aggregated) storage to be used for capacity planning and budgeting.

Log Aggregation

Kibana Discovery Page

Direct links to the log aggregation web interface should be provided whenever possible, including links to commonly used saved searches. Any commonly run queries should be documented here, along with a brief description of how and when they can be used. Anything that makes it easier for Operations to identify issues or narrow their investigation will save time during an outage.

Access Control

The objective of this section is to make it quick and easy for Operations to identify what could have gone wrong with the system if someone complains that they are not able to authenticate or do not have access to the necessary resources. It should also identify what group of administrative users can be contacted if special permissions are needed to investigate an issue.

Backup and Recovery

How to establish a disaster recovery plan is beyond the scope of this article, but there are plenty of resources available which describe such documents. Refer to Top 10 Free Disaster Recovery Plans or type “Disaster Recovery Plan” into your favorite search engine to get more information.

Maintenance and Cleanup

When a disk alert is received from your monitoring system, this section should provide instructions about what actions can be taken to provide immediate short-term relief. If the filesystem is 100% full it may be necessary to take immediate action to cleanly shut down the application, increase the storage, and bring the application back on-line. In other cases, it may be possible to clear caches or execute cleanup scripts to bring disk, memory, or CPU usage back under control. Documenting how and when these cleanup activities should be executed will save critical time when responding to system alerts.

Application Tuning

For example, if the application owners have developed guidelines for how to optimize the memory allocation based on the number of users, concurrent requests, or other observable data, that calculation can be provided here to provide the Operations team with some guidelines for what is or is not appropriate.

An Operations Runbook can take many forms, but the most effective ones are the ones that are readily available and easily understood. Remember, these documents are used in periods of extreme stress when the application or the infrastructure is in a bad state. The last thing anyone has time for is reading manuals or hunting around the filesystem looking for clues. Runbooks should be clear and concise reference materials. Keep them short and consistent across all applications. The more predictable the format, the better.

Release Engineer with an interest in pipeline traceability and observability.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store