The SRE Incident Response game

Photo by Clint Bustrillos on Unsplash

Shall we play a game?…

A sustainable on-call duty is critical for any Incident response, providing the human filter for the reliability and availability of services that are under the watchful care of the engineering team. The key word is sustainable. Burnout from sleep peppered with PagerDuty alerts, complex and adrenaline pumping incidents can take its toll. Establishing an on-call rotation where an engineer feels underprepared, stressed and likely ready to quit is something we want to avoid, not only because it’s harmful to your engineers but also to maintaining the availability of your services. So let’s set the scene.

What do you need:

The Players

Games Master — You are responsible for guiding the scenario, this includes triggering the incident in PagerDuty, describing the scenario and providing context, as well responding to questions from the on-call engineers.

  • Communicate to stakeholders
  • Resolve the Major incident
  • Fix the incident if possible
  • Escalate if to a Incident Commander
  • Aid in collecting information for future forensics

The Cards

Action Cards:

The goal

With a countdown of 5 mins you have to:

  • You may escalate to a Major Incident via the Incident Commander (if required)
  • Resolve the incident.
  • Go back to bed.
  1. Encourage a culture of learning and feedback — Any decisions you made in the game are the right ones for the context of the situations. Use the mini-postmortem at the end to provide some immediate guidance for improvements and also celebrate the team’s successes!
  2. Keep a regular cadence of sessions — The more times you run these sessions the more practice your team gets, and more comfortable they will become with the processes and practices of being on-call.

Senior SRE, DevOps, AWS, Terraform