Shall we play a game?…
A sustainable on-call duty is critical for any Incident response, providing the human filter for the reliability and availability of services that are under the watchful care of the engineering team. The key word is sustainable. Burnout from sleep peppered with PagerDuty alerts, complex and adrenaline pumping incidents can take its toll. Establishing an on-call rotation where an engineer feels underprepared, stressed and likely ready to quit is something we want to avoid, not only because it’s harmful to your engineers but also to maintaining the availability of your services. So let’s set the scene.
It’s 3am, you and your family are fast asleep. Then BAM! You are woken by a PagerDuty alarm, ripping you from your dreams into what could be potentially a major incident. Your adrenaline is spiking, yours eyes are still focusing and you haven’t even had a cup of coffee yet. Are you ready to jump in and triage the page? Do you know where to start? Have you seen this error before? Are you even familiar with that particular service?
So how do we become comfortable with the uncomfortable? This takes practice and experience responding to incidents until it becomes muscle memory. However repetition alone isn’t going to build this capacity or improve the teams performance for responding to incidents. K. Anders Ericsson, a leading psychologist in expert performance, describes elite performers practice differently from everyone else. They engage in deliberate practice which has specific goals with a defined focus. Critical to this practice is immediate feedback, allowing for lessons to be learnt from mistakes which are made in a safe environment.
A great way to direct this practice and cultivate the learning culture is by running regular disaster role playing games. The RPG game created by SRE’s, often called “Wheel of Misfortune’’ or “Walk the plank”, is an exercise that allows junior SRE’s or on-call engineers to walk through an incident and their response to the incident in a safe environment with feedback and hopefully with some fun thrown in. There are many ways you can run these games, check out these resources if you want to know more:
Using SRE and disaster recovery testing principles in production | Google Cloud Blog
Your pager is going off. Your service is down and your automated recovery processes have failed. You need to get people…
I am going to go through the variation we use to up skill our on-call engineers we called “The Kobayashi Maru”, the name we borrowed from the Star Trek training exercise to test the character of Starfleet cadets.
What do you need:
Games Master — You are responsible for guiding the scenario, this includes triggering the incident in PagerDuty, describing the scenario and providing context, as well responding to questions from the on-call engineers.
Incident commander — You are the escalation point for the incident. Your directive is to:
- Stem the bleeding
- Communicate to stakeholders
- Resolve the Major incident
Primary On-Call — You are the primary point for all alerts. Your directive is to:
- Triage the incident
- Fix the incident if possible
- Escalate if to a Incident Commander
Secondary On-Call — You are learning the ropes of being on-call. Your directive is to:
- Assist the Primary On Call to resolve the issue
- Aid in collecting information for future forensics
The primary and secondary on-call players will choose 3 Action cards and 2 Special ability cards. The incident commander also receives 3 Action cards and the Games master receives 5 Play Now cards.
Each player must have their phone on them with PagerDuty to receive pages.
These are used by the player to perform a response to the incident, with each one having a buff (+1) and a hit (-1) depending on your special ability. An example is “check the logs” action. A player can play this card, but if they don’t have the “Runbook” special ability card then they take a 30 second hit while they dig into the logs. Here are a few examples of action cards we use.
The incident commander also has action cards for them to play, in the event that the incident is escalated. Here are some examples we use.
Special Ability Cards:
These cards are used as a bonus ability for the player, giving them abilities like the “Runbook” card. This provides the player with rapid recovery paths or even the “IAC” card which allows the player to deploy infrastructure using code rather than ClickOps. Here are some examples that we use.
Play Now Cards:
These cards are used by the Games master that can be played anytime to help move the incident along or add some heat to the incident. Some examples are:
With a countdown of 5 mins you have to:
- Triage the incident using only your action cards and special ability cards.
- You may escalate to a Major Incident via the Incident Commander (if required)
- Resolve the incident.
- Go back to bed.
The Primary and secondary on-call players take it in turns to respond to the incident using their action cards, and escalate as a major incident to the Incident Commander if required to do so. If the timer runs out, this indicates your shift has finished and you must hand over to the next team and the clock restarts.
Starting the game:
To kick off the game you will need to do some prep beforehand, ideally grabbing some past incidents with a mixture of major incidents and noisy alerts to keep it as realistic as possible. We use PagerDuty and preload the incidents with the information so that it is ready to push out during the game. Once you’re ready to go and have assigned all of the roles and everyone understands their responsibilities, you can start the game.
Trigger the first incident in PagerDuty, assigning it to the person with the Primary on-call role and start the timer when they have acknowledged the page. To add a little bit of pressure, we have a 5 minute timer counting down that is visible for all players during the game.
The Primary on-call will review the page and their action cards to determine what they can do, with the Games master providing some context to the alert, such as “a deployment of X services took place 5 minutes ago” or “Multiple synthetics are also failing for service Y”. The primary and secondary take turns and attempt to identify the issue, triage and resolve while talking through decisions being. At any time the Games master may use their “Play Now” card to spice up the game or force a change in direction.
The game postmortem
At the end of each game we run a mini postmortem to identify what went well, what didn’t go so well and what can we learn that can be carried over to future rounds. It’s also important to get immediate feedback on the teams performance so actions and improvements can be made.
Don’t forget to have fun!
There are lots of ways you can run a Game session and put your flavour on it, but a few things to keep in mind that help us.
- Have fun — You will get more engagement if you can keep some fun and humour in the games.
- Encourage a culture of learning and feedback — Any decisions you made in the game are the right ones for the context of the situations. Use the mini-postmortem at the end to provide some immediate guidance for improvements and also celebrate the team’s successes!
- Keep a regular cadence of sessions — The more times you run these sessions the more practice your team gets, and more comfortable they will become with the processes and practices of being on-call.