Tabletop Exercises for Engineering Teams

I recently had the opportunity to lead a weeklong set of tabletop exercises and I wanted to share my thoughts on why I think tabletop exercises are a great way to teach how a complex system works.

What are Tabletop Exercises?

I was first introduced to tabletop exercises from the Information Security and Emergency Preparedness domains where tabletops are frequently used to simulate a real world event (e.g., cyber attack, natural disaster) and to force participants to act out how they would respond to these events.

To couch this in terms of software systems, a tabletop exercise is one where a failure of some sort is manually injected into a software system and then the participants in the exercise must run through the process of fixing the failure (i.e., identify, triage, fix and conduct a post-mortem).

Here are some examples of simplistic failures that could serve as good tabletop exercises:

  • ntp breaks on your distributed container orchestration cluster
  • your primary message queue runs out of disk and stops accepting new writes
  • a change in your firewall rules prevents an application from talking to its primary datastore
  • an application update results in corrupting data for 30% of your customers
  • a private key for your frontend web TLS cert is compromised

Why Spend Time on Tabletops?

Tabletops are first and foremost a teaching tool. Regardless of the system you are building, you need to educate your team on how it works. Tabletops allow you to do that in an active way by playing out realistic scenarios of failure in that system. The mental model people will form from actually fixing a system is likely to be much closer to reality than the mental model formed from just passively reading documentation. Having everyone in the same room and actively engaging in a system that is failing is an excellent way of teaching the participants how the system actually works.

At the same time tabletops allow you and your team to identify real gaps in your systems and processes. Ask yourself when you want to find out if your wonderfully documented, and untested, process for database restores doesn’t actually work: during a controlled exercise? Or in production at 2am?

Finally, tabletops provide a way to reduce implicit communication barriers in a team. This is especially important if a team is just starting to work together as they may not be comfortable yelling out what they are investigating or what they are seeing in the system. However, over-communicating is critical to successfully triaging complex systems. By running practice simulations you can help the team build good communication practices.

Some tips for running a successful tabletop:

  • Make sure everyone is in the same room. Even if your team is fully remote and distributed (as mine is) it is important that you all get together for the tabletop exercises. Tabletops are for improving team communication as well as teaching how the system works, and it is hard to do either from a Webex. Also, there will normally be a bunch of brainstorming that goes on during post-mortems about how to fix gaps that were found in the incident, and brainstorming is normally more efficient in person.
  • Dedicate a continuous block of time. I find that 3–4 days works best, as this gives you enough time to practice multiple scenarios, but is not so long that people lose energy.
  • Conduct scenarios in a realistic environment. It is important that the scenarios you enact happen in an environment as close to prod as possible so that lessons learned will actually carry over to production. This means that you should ensure that all monitoring and alerting is set up to work in the tabletop environment. You should also encourage the team to use the same tools that they will have available in a real-world incident. This means that if they normally use slack to share code snippets, they should do that during the scenario.
  • Use the post-mortem to close mental gaps. If you are the person running the scenarios it is very hard to not jump in during the scenario to offer advice on how to fix something. You must resist this urge and hold your feedback until after the scenario is completed because allowing the team to fumble around in the dark for a while is the best way for them to learn. However, be sure to cover any missed symptoms or wrong conclusions in the post mortem as you don’t want the team walking away with the wrong mental model of the system.
  • Record your findings and action items. There will be a lot of actionable work items coming out of a tabletop, and you should be walking away with a list of new user stories to implement. These stories will normally focus on new monitoring and alerting, new automation, and new runbooks. Essentially anything you identified as a gap during a post-mortem should end up as a story for the team to implement.

Who Should be Involved?

This really depends on the layout of your organization, but I think the following breakdown of roles is useful:

  • Proctors: The proctors will run the scenarios by creating the simulated “disasters” and guiding the participants through the triaging of the incident. Proctors should not play an active role in incident triaging, but should provide guidance when necessary.
  • Participants: The participants will not know about the scenarios before hand, they will be responsible for identifying the “disaster” that the proctors enact and triaging it.

You want your ‘proctors’ to be the people that understand the system best, ideally they should be the ones that have designed and built it (e.g., senior engineers, systems architects).

The ‘participants’ should be those people that are going to be involved in the day-to-day operations of the system in production. This group could include the traditional Ops (or devops) team, QA engineers, or engineers building the features.

Crafting Good Scenarios

Crafting failure scenarios that actually teach how the system works in a realistic way is not easy and this is probably why people do not run tabletop exercises as much as they should.

Some tips:

  • Break the system in the most realistic way possible. For example, if you are crafting a scenario that involves a disk-outage in your message queue don’t just fill dummy files with dd from /dev/random (dummy files appearing on disks probably won’t happen that much in prod). Instead, actually overload the message queue with real messages.
  • If you can’t break the system in realistic ways, run a whitebox scenario. It is okay if you can’t think of a realistic way to break something, it is not always easy and we don’t always have unlimited time to craft scenarios. In these cases just tell the participants that the scenario is a whitebox scenario and explain the fictional disaster. This way the team can still test runbooks for the particular disaster without wasting cycles trying to identify something they are not likely to see in production. For example, if you don’t have time to alter application code to simulate db corruption in a realistic way (running DROP DATABASE from a mysql shell is not a realistic way), just have the team drill the db restore runbook and verify it works.
  • Play the part of the customer. If you are leading the exercise, it is okay to act the part of the customer. Although it is not ideal, sometimes our first indication that something is wrong comes from the customer, so if you are running a scenario and you know you lack monitoring you should play the part of the customer and tell the team what outage you are experiencing (e.g., I’m customer_x and I am unable to login to the system).

Most importantly, you want to make sure you have these scenario scripted and tested before going into the actual exercises. The last thing you want is to be fumbling around wondering why your bash script to introduce the failure for the scenario is not actually causing any failure; all while the room of engineers is staring blankly at their dashboards.

We Can’t Imagine all Failure Scenarios

One risk of conducting simulated exercises on any system is that there is potential to lure participants into a false sense of security in their understanding of how things are going to break. Let’s face it, we are human and our brains are not going to be able to think of every possible failure scenario especially because as our systems become more and more complex the set of all possible failures grows exponentially. Even if we could there is not enough time to practice all of them.

This is totally okay though because the point of tabletops is not to have participants memorize the scenarios and solutions, the goal is to teach participants how to debug failures so that they can use these skills to fix any arbitrary production incident that the universe happens to throw at them at 3am.

Just remember, it is important to emphasize to the participants that the simulated tabletop scenarios are in fact simulations and while they hopefully represent the types of incidents that will happen in production, production will always surprise you.

Tabletops Are Only the Start

Tabletops are a great first step in introducing an organization to how to handle failures. I think they are also a great introduction to things like Chaos Engineering, but they should be seen as just the beginning. Systems, and the people running those systems change frequently so it is important to run tabletops a few times a year.

As your organization matures you may also want to start looking at running things like Chaos Monkey in real customer-facing environments to continuously practice the runbooks and alerting structures that were developed during the tabletops. This way you can continuously test your organization’s ability to respond to failure so that those skills do not atrophy. The last thing you want is for an extended period of production uptime to cause people to get lazy and/or forgetful so that when a real incident does occur the impact is greater than it should have been.

Also, if you are interested in understanding how to build production ready systems I highly recommend reading Release It!, and for books focused more on complex system failures across various domains I’ve enjoyed The Logic of Failure and Normal Accidents.