Teach Incident Response with Games

Use incident response games and roll out confident SRE teams before systems go live.

Paul Kirk
Slalom Build
Published in
9 min readDec 20, 2019

--

You build it, you own it

As more organizations adopt SRE and embrace the “you build it, you own it” operating model, new product engineering teams may face anxiety about operating their products in live software systems. Incident games are fun team-building exercises that provide key insights into how individuals and teams work together to solve simulated problems. Practice incident response and train confident SRE teams before systems go live.

We’re Gamers!

Games are great ways of reducing anxiety over operating live software systems. They provide fun and informal settings for observing how teams problem solve in stressful situations. Moreover, a strong gaming culture likely exists within an engineering organization so “playing” to this strength is an effective strategy for informally introducing SRE incident response principles to new product teams.

Two people playing Sony PS4 game console
Photo by JESHOOTS.COM on Unsplash

A Path to Practicing Real Incident Response

One outcome of running games is that it provides ways of identifying team members who are most likely to succeed in one of the incident-commander roles: Incident Commander (IC), Deputy, Scribe, Liaison and Subject Matter Expert (SME). Before practicing a Game Day in a live software environment, run fun mock incident games first and assess how well the team solves simulated problems. Use mock games for learning about team interactions and player communication in a shorter period of time. These observations form and shape operating norms that work best for the team in a real world incident.

Mock incident response games reduce perceived misconceptions about supporting live software systems such as “fear-of-the-pager” and games help reinforce the notion that we’re in it together and we have each other’s back. Including mock incident response games as part of an overall SRE training program establishes an early mindset in the product SDLC that reinforces a strong “we own it” culture and provides a safety net for new engineers joining the product team.

Use mock incident response game observations and qualify players for formal incident roles later. These initial assessments provide a baseline for evaluating how well your team might perform in real software-based Game Day scenarios.

Person working on blue and white paper on board
Photo by Alvaro Reyes on Unsplash

Getting Started

Find a dedicated person from your team who serves as your Game Master for the entire gaming session. The Game Master schedules the sessions, collects data about the game play including the effectiveness of shared communication tools, player and team communication behaviors, and makes recommendations for formal incident roles in a future Game Day.

The Game Master facilitates retrospectives with the team after each game iteration and reviews what went well, what went poorly and what needs to improve on the next round. Think of this a lightening scrum retrospective — keep it under five minutes and avoid over analyzing. The Game Master keeps the session fun, moves it forward and always brings pop and pizza for the players!

Picture of people sitting around pizza boxes and coke cans
Photo by Evelyn on Unsplash

Game Recommendation

Keep Talking & Nobody Explodes is an awesome mock incident response game with a simple premise: tell a bomb defuser how to prevent a bomb from exploding. The defuser cannot see the manual and the bomb experts cannot see the defuser’s screen. What makes this game cool is its short, five minute duration, its uses of randomly configured bomb modules, and its ability to include distracting and annoying sounds. These sounds may represent an impatient executive badgering the team for incident updates or a stressful reminder about the steady tick of lost revenue and users. This game is an excellent way of simulating the stresses of an incident in a low stakes, fun gaming experience.

Organizing the Game

The structure of a company and the location of product teams affects the organization of the game.

When product teams are located in different geographic regions, first organize a local game with players from co-located product teams. Although the dedicated product team won’t be in-person, there is value watching how engineers play the game and interact. Plus, it’s easier to organize and helps iron out the kinks before doing a larger organization rollout. Reap the learnings from these experiences and be a champion for running local incident games in other product build centers.

After running a local game, organize remote games later and invite the dedicated product team to dial in. The Game Master observes how the team uses the communication tooling during the games. Ask players to install the communication client ahead of time and ask everyone to become familiar with the tools’ operation — especially the mute button!

Three people sitting down and pointing at a silver laptop computer
Photo by John Schnobrich on Unsplash

Setup

Keep Talking & Nobody Explodes has a number of client options including mobile. For a local gaming session, install the client on one laptop and then share it with others players as they take turns being the defuser. Download the accompanying bomb defuser manual and print copies.

Avoid playing the game by yourself or studying the manual ahead of time before the official game starts. This defeats the purpose of knowing what might happen and significantly impacts the effectiveness of simulated incident response. Those who have installed the client should spend time reviewing the menus, controls, and configuration — but avoid starting a game without the players!

For a local gaming session, book a conference room or find a space where the players can get together. Anticipate 90mins for four to six players and schedule other sessions for larger teams — everyone wants a turn as the bomb defuser!

grayscale photography of man in striped shirt setting up equipment for a concert
Photo by Adi Goldstein on Unsplash

Game Modes

There is no prescriptive way of playing Keep Talking & Nobody Explodes, but here’s a set of scenarios to run for an organization leaning towards an incident-commander response model:

  • Round Robin Operator Game Mode (local or remote) — in this mode, all players in the room (including dial-ins) take turns being a bomb defuser.
  • Incident Commander Game Mode (local or remote) — in this mode, the Game Master assigns the best Game Day roles based on observations made during the Round Robin Operator games. However, give players the flexibility to try out different Game Day roles if they want to.

Before running the Incident Commander Game Mode, ask all players to listen to a recorded incident response call as this provides an excellent reenactment of the various roles that participate in a SEV3 incident.

Running the Games

When the bomb experts are ready, the Game Master instructs the bomb defuser to start a session. At the end of each game, run a retrospective and try again. If players fail to defuse the bomb the first time, keep the same bomb defuser and experts together until the incident has been solved. This allows the Game Master to measure the effectiveness of communication improvements proposed by the team during the retrospective.

While running the Incident Command Game Mode, the Game Master also takes notes on how well the players perform in their assigned Game Day roles.

A group of men standing on the starting blocks of a swim race
Photo by Arisa Chattasa on Unsplash

Empowering the Team

Mock incident games provide insights into team problem solving. Try to help teams understand where they excel and fail, and encourage the team to define consistent operating norms that work best for them. Highlight situations where the team makes steady improvement with problem solving and celebrate success as the complexity of the game increases. Use the following evaluations for helping teams define and harden their live incident response norms:

  • Communication Tool: What characteristics or usage of the communication tool helped or hindered the mock incident response?
  • Players: What individual behaviors and communication styles helped or hindered the game play?
  • Roles: What are the strengths and traits to look for when evaluating players for future live incident response roles?

Communication Tool Evaluation

The outcome of the Communication Tool Evaluation is a set of team communication norms that everyone agrees is the most effective way of using the tool during a live incident. However, the team may feel it is important to evaluate other communication tools first before committing.

  • Do players operate the tool competently?
  • Do players demonstrate timely and appropriate use of the mute button?
  • Do players minimize background noise and distractions throughout the game?
  • Is player video sharing helpful or distracting?
  • Is player chat messaging helpful or distracting?
  • Is a contingency communication plan in place when experiencing connectivity issues?
  • Is player language clear and comprehensible?
A man sitting down and yelling into a rotary phone
Photo by Icons8 team on Unsplash

Player Evaluation

The outcome of the Player Evaluation is a set of individual communication norms that everyone on the team agrees is the most effective communication style during a live incident.

  • Do players interrupt or talk over each other?
  • Are some players more assertive than others?
  • Do players show visible signs of anxiety or stress such as clenched fists and teeth, finger and foot tapping, rubbing face, closed eyes, and excessive sighing?
  • Do players appear annoyed, impatient, frustrated and use profanity during the game?
  • Do players stray from their lanes or fail to abide by the game rules?
Picture of a foosball table
Photo by Bruno Aguirre on Unsplash

Role Evaluation

The outcome of the Role Evaluation is a player report with recommendations for further formal incident role training.

Incident Commander and Deputies

  • Demonstrated ability to coordinate the response and direct players during the incident
  • Demonstrated ability to keep players in their lanes
  • Demonstrated ability to ask for appropriately timed updates from the SMEs
  • Demonstrated ability to remain calm
  • Demonstrated ability to react quickly to changing situations during the game without exacerbating stress

Scribes

  • Demonstrated ability to document key moments and milestones of the game in proper order and time
  • Demonstrated ability to fill in gaps about the incident during the games’ retrospective
  • Demonstrated ability to clarify facts and events appropriately during the course of the game

SMEs

  • Demonstrated ability to remain calm during the game incident
  • Demonstrated ability to clarify, rephrase and provide contextual updates during the game— on point
A co-pilot and captain in the cockpit of a commercial plane facing the runway
Photo by Jon Flobrant on Unsplash

Training for Real

The table top game Risk is a fun way of trying out your military strategy but hardly qualifies you for a a real military career — there is no penalty for failing in the game. Similarly, playing Keep Talking & Nobody Explodes is a nice way of introducing mock incident response in a relaxed environment, but real incident response is serious business when brand, downtime, and revenue are on the line. A mock incident response game also won’t simulate grogginess from a 2am wake-up call. However, as part of an overall incident response training program, mock incident response games provide supplemental training for live on-call product teams. Design applications and systems with chaos engineering in mind and measure the effectiveness of the team as it faces uncertainty in Game Day situations. Practice and be ready for the unknown.

Games provide valuable practice time similar to sports teams who spend time reviewing tape, running drills, and installing adjustments before the big game. Investing in games demonstrates an organization’s empathetic approach for addressing the anxiety and stress that on-call engineers face during live incidents. Use the outcomes from games and build confident response teams who embrace failure rather than fear it.

The best time to learn about fire is when you’re on fire.

— Jen Hammond, New Relic engineering manager

--

--

Paul Kirk
Slalom Build

Paul Kirk is a product software veteran from Seattle, WA. He is a music lover, vinyl enthusiast, sci-fi reader, and a huge fan of UW football.