Incident Response Processes: Tabletops

Sean Clemmer
In the weeds
Published in
7 min readAug 21, 2019

Background

After a painful series of outages in January of this year, we at Greenhouse found ourselves a little embarrassed but also determined to avoid the situation in the future. Conducting a large, cross-functional review, we identified major flaws with our Incident Response strategy. On the technical side, there were both missing and misconfigured alerting rules, which meant our responders were not properly notified of issues. Perhaps more importantly, though, there was a general lack of clarity and confidence around how incidents are managed: When is it appropriate to declare an incident? How does one go about doing that? What even is an incident?

While we were able to address most to those technical flaws within the next few weeks, it has taken months to design and implement solutions for us humans. There is plenty of advice to be found online as organizations begin to share their reliability stories, and indeed Incident Response here at Greenhouse is heavily influenced by industry leaders like PagerDuty and Google. In that spirit, we hope this document provides practical, actionable resources and real-world examples for anyone interested in developing reliability practices.

Today, we describe our journey in implementing one such established practice, something we call the Tabletop.

Development

In surveying the landscape of Incident Response internally, we noticed a number of related problems: First, teams outside Engineering often did not know how to trigger a page or may not have been comfortable enough to pull the trigger. Teams within Engineering also maintained their own documentation around on-call rotations, leading to subtle differences in implementation and expectations. Training was applied inconsistently across teams, and many tenured engineers never actually received formal training. In short, we failed to design a coherent system for Incident Response, and we failed to share this vision with everyone involved.

So we rewrote our internal documentation, leaning heavily on prior art at PagerDuty. We created a unified hub for Incident Response materials at Greenhouse, organized for quick reference. We designed separate guides, references, and checklists for each role and stage of Incident Response, which makes it easy to navigate whether you’re an Incident Commander in the middle of an incident or a new hire just poking around.

While accessible, articulate materials are necessary, alone they are not sufficient to close knowledge gaps. This information should be absorbed and mastered. To address this, we decided to create a new kind of training exercise and revise our processes to ensure everyone receives proper support in their journey. We call our new exercise the Tabletop.

The Tabletop is a practice in the same vein as the Wheel of Misfortune or Walk the Plank exercises found at places like Google, where engineers apply incident response skills by reacting to an imagined or staged scenario. No servers are harmed in the making of our Tabletops; they are completely virtual. In practice, it works a bit like a game of Dungeons and Dragons: Players assume a role and navigate through a world designed and controlled by the Dungeon Master. We’ve changed a few of the titles, but the game remains the same!

Tabletops

Dramatis Personæ

You’ll need a few people to put on a Tabletop:

  • Moderator
  • Notetaker
  • Players

The Moderator conducts the Tabletop, and as such should be familiar not only with Incident Response processes but also Tabletops themselves. Moderators come prepared with a handful of scenarios, ready to drop in and either play or direct various roles. Moderators describe the scene, respond to in-game events, and drive discussion.

The Notetaker documents actions taken during the exercise for later review. Notetakers also help the Moderator maintain clear communications and adhere to a consistent narrative. As a consequence, Notetakers should also be familiar with both Incident Response generally and Tabletops specifically.

Any number of Players are required to actually implement scenes. Players choose (or are assigned) any number of real-life roles, from Customer to Incident Commander to CEO. Players describe actions they would like to take and receive updates from the Moderator. Players may change roles between scenes. While Players need not be familiar with Tabletops, they should come prepared to take on a role in Incident Response.

Schema

Each meeting begins with a short introduction by the Moderator on Tabletop mechanics. Depending on the crowd, the scenarios, and desired goals, we recommend an hour to execute around three scenes. Our meetings typically look something like this:

  • 5m to arrive, intro
  • 5m for warmup scenario
  • 2m to review, discuss
  • 10m for first scenario
  • 3m to review, discuss
  • 20m for second scenario
  • 5m to review, discuss

This leaves a few minutes slack while also limiting the number and scope of exercises to avoid fatigue. We follow each scenario with a short review and discussion: Was the incident managed well? Was there anything we could have done differently to achieve a better outcome or a faster resolution?

Moderating

It is imperative that the Moderator is well-prepared before arriving. To that end, we maintain a Scenario Bank internally: For each scenario we provide a short sketch, which outlines important causes, actions, outcomes, and offers alternative avenues of exploration. We also include a rough size estimate and list of roles involved to aid in planning. An example:

Name:
Night, Night, Foobar
Size:
Small <10m
Description:
Critical alert saying service Foobar is down during US night.
Regional failure in Foobar not reflected accurately in DNS.
Roles:
Incident Commander
Application On-Call
Infrastructure On-Call
Support
Sketch:
Support able to reproduce issue.
Application On-Call unable to reproduce issue.
Metrics show Foobar at ~25% error rate
Metrics show Foobar errors localized to region West
Logs show errors related to backend database
No loss of data, but requests are denied
Need to page Infrastructure On-Call
Infrastructure discovers inaccurate DNS
Infrastructure manually triggers DNS failover
Avenues:
Multiple regional failures
Infrastructure doesn't answer page
Infrastructure sinks time into fixing database
Discussion:
Appropriate to mitigate, fix tomorrow
Is Incident Commander necessary here?
When is it appropriate to status page?

As we run Tabletops periodically, new scenarios are added, and old scenarios may be updated. Scenarios are typically drawn from actual incidents, so we can review the differences between the actual and simulated responses during our discussions.

With a scenario in hand, the Moderator can start the game. First, roles are assigned to participants. We often call for volunteers, but obviously you’re free to use whatever strategy you like. We have also found it best to assign everyone in the room a role, however minor, to keep them engaged.

With roles assigned, the Moderator introduces the scenario and describes the inciting event. Players try their best to verbalize their intended actions, which must then be acknowledged by the Moderator to have any effect in the game. As Players, especially new Players, may not be aware of all potential actions available to them, the Moderator may provide alternate suggestions or point to existing resources like runbooks. We have also been experimenting with “character sheets” to help guide Players. Moderators are also free to accept less precision: A Player doesn’t necessarily need to execute a runbook step-by-step; they might execute it as one big action.

Notetaking

Notetaking requires less preparation, but Notetakers are certainly no less engaged in the exercise than the Moderator. We begin each Tabletop by sharing a document, which contains a short preamble with meeting details like attendees and recording link. Each scenario gets its own section where the Notetaker records minutes, trying to capture the high-level semantics of what happens rather exact phrasing. With this birds-eye view of the Tabletop, the Notetaker is often able to provide useful insights during discussions.

Tabletop recordings are made public so that anyone can review. The accompanying notes provide useful highlights. For example:

2020-01-01 TabletopModerator: George
Notetaker: Paul
Incident 1: Night, Night, Foobar
Incident Commander: John
Application On-Call: Ringo
Infrastructure On-Call: Mary
Support: Peter
Mod: Application On-Call paged: service Foobar is down
App: Acknowledge page. Grab laptop and login to observability services
...snip...
IC: Calling this Mitigated. Updating status page. Revisit tomorrow AM.
Mod: Exercise complete.
Discussion:
Mary: Consider notifying Support sooner
Peter: Difficult late at night. May need PagerDuty
TODO(George): Strategy to notify support after hours

With access to the Scenario Bank, notes, and recordings, many participants feel comfortable enough to assume the Moderator or Notetaker role after only one or two sessions. While we started with only a handful or potential Moderators, we are now able to draw from a large pool of candidates, so no one individual is overwhelmed by the load.

Conclusion

Of course, Tabletops are also limited by their nature as “virtual” or “imaginary” exercises. Services don’t actually die, and the pager doesn’t actually fire. It is not real, so there is no real urgency. The mechanics can feel awkward: Players must “call their shot” and wait for the Moderator to respond and incorporate actions into the game. It is not always obvious what actions may be available to a Player.

In practice, most Players adapt quickly to the structure of the game, finding it increasingly easier to maneuver as they internalize the incident response process. To that end, we combine Tabletops with more realistic experiences, for example staging incidents or shadowing On-Call.

Compared to those exercises, though, Tabletops are more flexible and require fewer resources: You only need a few scenarios (already prepared), a bit of time, and a handful of willing participants. Tabletops can be executed locally, with everyone in the same room, or completely distributed, with everyone joining a shared conference. Scenarios evolve naturally to reflect teams’ experiences and sensibilities. In fact, we may be in for a name change, as some of our engineers have taken to calling the Moderator Dungeon Master and Notetaker Page Master!

While it has only been a few months in practice, and though the process is still evolving, we have not had another string of incidents like we saw this past January. Tabletop participants have responded with enthusiasm, and the increase in confidence is readily apparent. Exercises like this help us provide our team members with a comprehensive introduction to our reliability engineering practices. Greenhouse engineers now feel more prepared to tackle incidents of varying scope, and teams outside R&D feel more assured and involved.

Interested in projects like this? We’re hiring.

--

--