Break Your Software, or, How to Run a Gameday
I’m the last person anyone would have expected to write a blog post on how to plan and run a gameday. I was on the Product Research team for two and a half years, where we just built prototypes all day. Since none of our projects made it close to production, I didn’t need to worry if my services were super flappy. Before I was at New Relic, I was a junior engineer at a “small, fast-paced startup.” That’s code for “we didn’t always follow best practices when it came to keeping software reliable.” When we decided to start building a Logging product reliability started to matter, so my manager asked me to take charge of planning and running our team’s gamedays. I made a lot of mistakes in my first gameday, less mistakes in my next, and now I’m getting the hang of these things. I figured it was a good time to put my lessons in writing in the hopes that you, dear reader, can learn from my mistakes.
What is a gameday?
Basically, “gameday” is what we at New Relic (and other companies I’m sure, but when I describe them to my friends they all know it by different names) call “breaking things on purpose so you can practice fixing them.” Examples include introducing latency into database-server connections, throwing excessive amounts of data at a service, disabling an intermediary service, and even fat-fingering some commands.
We block off two hours and follow a schedule of scenarios to run through. Before our scheduled gameday time, we warn the company via email or Slack so that if anyone sees something go wrong, they will know that we’re just practicing and won’t panic. Each scenario has 3 roles assigned: Incident Commander, Driver, and Outcome Recorder. The Incident Commander plays the part of the person on call, being paged for the incident. The Driver does the things to cause the incident. The Outcome Recorder takes notes on what happened, and what the team could do moving forward to lessen the impact of this thing going wrong in the future.
Why do gamedays?
We gameday because it’s better to practice fixing things in a controlled, supportive environment. Imagine having to fix something you don’t understand for the first time when its 2am and angry tweets are speeding by. Just like ski teachers teach people how to fall without breaking themselves before they teach people how to do fancy tricks, it’s helpful for software engineers to learn how to fix their services without the stress of a real incident, ideally before there is a real incident. In other words, we’re crawling before we’re running.
How do you plan a gameday?
First, you need to choose what scenarios, or incidents, you’re going to run through in the gameday. Usually the maximum number of scenarios my team can run through in one gameday is three, but it’s better to have prepared more in case you have extra time; you can always do more gamedays in the future with the leftover scenarios. Choose scenarios that are likely to actually go wrong or that you want to experiment with — for example if you are worried that your systems won’t be able to handle heavy load, and you know that you will soon get more load than you’ve had before, a good scenario would be to bombard your system with lots of data. This way, you’ll be able to learn how your systems will react and practice recovering from any potential issues. You should also choose scenarios that are easily repeatable, so if you have to you can run through the scenario multiple times (for example, if new members join the team). Last, make sure that you have alerts and runbooks ready for your scenarios, so the team can evaluate the effectiveness of the existing alerts and runbooks.
Next, assign roles for each scenario. As I mentioned above, there are three (and a half) roles in each scenario: Incident Commander, Driver, and Outcome Recorder. The half role is Tech Lead, who the Incident Commander can pull in to help with the technical tasks for resolving the incident. I like to assign Incident Commander to someone with less Incident Commander experience, and outcome recorder to someone with more incident response experience. That way, knowledge can be more balanced within the team.
Next, it’s important to specify the steps (with as much detail as possible) that the driver will need to follow to create the incidents. Make sure you’ve run through the steps yourself, so you know they’re correct (this is important, and the cause of some of my earlier mistakes). For example, if your scenario includes disabling an intermediate service so you can make sure the team will be alerted, make sure the steps include which service to disable, how to disable the service safely, as well as which environment to disable the service in (hint — it should never be production!).
Last but definitely not least, make sure all the scenarios are going to work. If your scenario involves bombarding your system with huge amounts of data, make sure you’re able to send huge amounts of data to your system. If your scenario involves hitting an endpoint, make sure that endpoint exists and that it’s available. You don’t want to be wasting your time and the team’s time during the gameday on debugging and troubleshooting the pre-incident tasks.
Hopefully now you’re sold on the importance and value of gamedays, and that this blog post helps you run your next gameday.