A Playbook for a Script
I saw a tweet from Sam Kottler that I tried to reply to but couldn’t in 140 characters or less so thought I would rant on Medium.
Having said this myself, done this myself, and later cursed at myself for doing it, all of my own accord, I thought I would write something here. I don’t know Sam but this tweet hit me right in the feels. You see, I was too in this same boat. Runbooks are a graveyard of despair, often neglected, crummy to follow, rarely updated and can be damaging. But runbooks are still required. Why? Context.
Why spend time writing about the problem instead of just fixing the root cause? A lot of the time human judgment is needed to figure out what to do. We humans are apparently a lot smarter than these computer things so why not wake someone up at 4am to figure out if the red switch or the green switch should be pulled. Sometimes automating away the need for that human judgment can be a significant amount of work, and may not even be worth the effort if the alarm only fires 2 times a year. It may also add complexity, and more bugs into the system inadvertently causing more issues. A key element here is time. It is often a lot quicker in the short term to discover and write down the steps to fix is an issue than it is to permanently resolve the problem, otherwise a lot of runbooks simply wouldn’t exist.
I don’t think what Sam is alluding to in the tweet is to not have runbooks, I think the core of the point is don’t just spend time writing down all the steps in a wiki when you could spend just about equal, or sometimes less time writing some code to make it better. If all the steps are the same every single time you need to run this, then why not just put them in a script so they are run the same every single time. I’ll spend the rest of this talking about a few points around this.
Use runbooks for context
I think runbooks are fantastic for context on the situation. I don’t want just a list of steps to follow, I want to know why something is firing, how critical it is, what the user experience is when this is firing, time bounds, and how the heck does this even work. There is all this metadata that is associated with this event.
Here is a template I use for runbooks: https://gist.github.com/pshima/2665fedbe0ae56f9fc3454d3fd1c0418
The core part for me of the runbook is not the steps used to resolve it. Those have to be simple, at 2am with someone who may be unfamiliar with the particulars of the system they have to be able to resolve it. Your runbooks should be optimized so that anyone on the team can resolve the issue (not just the alarm) without any additional context outside of the runbook and general team knowledge. This isn’t always how things work in practice but it’s a good goal to have.
I like runbooks for setting the stage for a human that has been contacted about the issue. I think it is often forgotten that a lot of alerts are going to be sent to humans and these need to be parsed easily by a human, not a robot. If you have a generic dashboard for a service impacting issue, add it on the runbook! Have design docs, deployment dashboards or other maybe relevant information, reference it in the runbook!
Writing the script isn’t the end game, you should always ask yourself why it is required or if it is the best thing to do for the next week, month or year.
Exercise runbook scripts often
The problem I found with runbook scripts is the same as a common documentation problem. When your runbook script is part of your documentation and your documentation is always out of date then it’s likely your script will be too. Create your script so that it is always running in a monitor, or have the app itself execute the code. If your app can enable it as part of it’s own library, do that instead of creating that artisanal bash that will be run once a year. I really liked Kelsey Hightower’s talk at Monitorama 2016, around bringing health checking as close to the app as possible. This is also a really good technique to ensure that scripts you are relying on in those critical situations are not stale. Deploy them with the app! Not everyone’s environment is the same(it works on my laptop) so running scripts in the same environment/context is also important.
The video is below, while not about runbooks, I think it is related. In this example /healthz doesn’t need to just apply to monitoring, apply it to your runbooks too! It is very rare to see a link to a runbook right in an app log file, but why not? I can dream of a world where the /healthz style endpoint had particular app specific debugging steps, or where the runbook interacted directly with the app in some sort of Rundeck/Exec style fashion. Today each is hand crafted and usually split from the actual application.
For apps that don’t change often or when the API/interface doesn’t change often(like doing an HTTP/DNS request) stale scripts isn’t as much of an issue. What will haunt you over time is when you are creating bespoke scripts for issues that do not come up often and it is an absolute nightmare when you have a major issue and the runbook says just run a script that simply exits. At 3am you don’t want to spend time debugging a crappy perl script someone wrote in anger, you want to be spending time fixing the issue. In my experience a lot are referencing log files, and if you need to do this across a lot of hosts generic tools are incredibly valuable.
Honestly, set a standard for your scripts and runbooks. Should they have unit tests? Integration tests? Can they be ran easily at 2am? Are they designed for human input/output? Whatever your standard, don’t let your runbook scripts repository explode in to a mess of a million different languages and libraries, all of which you have to update and require to run at your most critical times. I find the quality of scripts produced for runbook actions is a lot higher when it is integrated into the application, or even in the same repository.
Make your runbook scripts do more than just echo your steps
It’s easy to just take a runbook and copy it in to some bash, add a hint of variables and some error checking and call it done. Woohoo a much improved runbook process!
But there is so much more you can do with these scripts than just simplify a process. Have your script echo output directly into your tracking issue. Have your script check other alarms or maintenance that may be related to the issue you are experiencing. Emit metrics from your scripts to track usage and track history of these. Add basic safety checks.
If you have a command that is potentially dangerous that is in a runbook, put the safety checks for that command in a script. It is incredibly easy to add a “Are you sure Y/N?” in to a script. Alternatively, set a standard that all your scripts are read only and any mutations require a code review. If you have a system with events, trigger read only scripts that put diagnostic information directly into the issue so when someone is woken up in the night, the information is already there.
Building libraries that makes this stuff easy to use inside your runbook scripts will really help leverage the power and help you move away from manual steps. When runbook scripts are echo’d directly into slack, update things automatically, or save you from worsening an issue with a simple if then this quickly starts to pay for itself.
Don’t put a bandaid on a serious problem
It’s not uncommon for real issues to be masked by runbooks and humans taking action to resolve an issue. The classic case is just restarting the service. “Have you tried turning it off and on again?” is a classic anti-pattern. If someone is getting paged and manually having to run some actions in the middle night, quickly this can become more expensive at a human level than it can be to actually resolve the issue. Waking up someone in the middle of the night can be extremely impactful to their life.
A common pattern can be if we have a tool that can fix it, can we just run that tool automatically? In other words can we just put that service restarter in a cron every hour? This is another anti-pattern that can again mask much much larger problems that can lead to very large outages. Working on automating your bandaid script to solve the problem instead of the root cause can be a huge waste of time and a big source of frustration for folks.
My point here is always think about what it would take fix the source of the problem instead of automating around it, these things can quickly catch up to teams that tend to take shortcuts slowing down features, frustrating customers and creating a lot of debt. Sometimes masking the pain, or not tracking it, is far worse than solving the problem. Someone that is taught to run a script to fix an issue, instead of investigate and problem solve it, is unlikely to move forward permanent resolution of the problem. That is until they get tired of running the script.
This alone is not enough. You have to keep improving your runbooks and processes. If you have the power, every few months take a look at the metrics for which runbooks are being used the most and the least. Can improvements be made to the runbook being used every week? Can we fix the root cause? Can we deprecate any old runbooks?
Every improvement made to runbooks can pay off for the next person using them and can be magnified for the amount of use. Have new staff update runbooks on their first use, or every time they are used.
Lastly, the team that owns the software should own the runbook, who better to write the runbook on something than the people that are experts on the system?
Thanks to Sam for tweeting this and for the commenters on the tweet. One thing I have never mastered is knowing when to make these trade offs.
Follow/Tweet at me on Twitter: @petey5k