(What, Why, When, Where, Who, and most importantly… hoW)
To the majority of people, software is magic.
To those who work in software, you know that it’s actually a lot of elbow grease and duct tape that hold everything together. There’s always an opportunity for something to go wrong, and when it does you need to be ready to recognize the problem and fix the issue, all while not causing any more pain.
If you manage or are part of a software development team, walk through these scenarios and see if any apply to you:
- Your team is responsible for many services, some of which you don’t have much experience with.
- Your team recently inherited and is now responsible for a new service.
- You recently joined a new team, and don’t know anything.
If any of these apply to you, or apply to anyone you know, you may find it beneficial to get ahead of the inevitable production disasters for your service and run a Disaster Simulation.
What (is a Disaster Simulation)
Simulate a “disaster” in your service, to learn from, without causing actual harm. A few points…
- A “disaster” can mean different things for different teams and services. Think of a problem that can occur for your service where there will need to be manual intervention from a team member. Examples may include: a MySQL server falling over, Elasticsearch returning a red cluster status, a new bug that’s causing customer pain, etc.
- This is an activity within the topic of Chaos Engineering; one that’s focused more on team preparedness and less on service resiliency.
- Practice, practice, practice
- Allows others to gain experience and context in solving disaster scenarios in a safe space.
- Decreases the necessary ramp up time to recognize problem(s).
- Helps prevent panicking and potentially causing more pain when a real disaster occurs.
2. Forces the review of runbooks for your service without implicit biases from those who wrote the runbooks and those that already have context on the service.
3. Discuss and review expectations for the service for when it’s in a disaster state.
If you’re thinking, our team already has experience and knows how, when, and where to execute the runbooks, then that’s awesome! In that case your team may not find running a disaster simulation worth the time.
When the team feels they are missing out on context and experience.
You may want to set a regular cadence to evaluate what service should be next up for a disaster simulation.
A staging environment. You want a safe space where mistakes are expected to occur and not cause harm.
Simulator: Those with experience with the service and its runbooks.
Driver: Those with minimal to no experience/context with the service.
- Ensure runbooks are up to date and in the correct, expected place.
- Prepare the “disaster” in a staging environment, being as close to expected production behavior as possible.
- Make sure to note down how you’re creating this disaster so you or someone else can reproduce it again at a later time.
- Define what the clear expectations are for the driver.
- Ensure you’ve practiced the steps you expect the driver to go through. To make sure there aren’t any dead ends that prevent the issue from being resolved in your staging environment.
- Lean back and relax before what will be a stressful (but hopefully educational!) experience.
2. Running the Simulation
- Cause the “Disaster”
- If necessary, you may need to explicitly point the driver towards the start of the problem (since you may not see alerts/exceptions in a staging environment).
- Tell the driver what the expectations are.
- Take notes: What is easy for the driver? What is difficult? How can the runbooks be improved?
- Guide the driver during sections if there are any explicit differences between the staging and production environments. (It’s not fair to the driver if they get stuck on a part of the problem that doesn’t accurately reflect what happens in production.)
- Try not to solve the problem yourself.
- If something is difficult or not clear, you can help guide them through that portion, but make sure to note down what is difficult and why.
- Resolve the problem with the runbooks! (Have fun!)
- Treat this as if it were a production issue. Post commands you’re running in a designated disaster simulation messaging thread and ask for help when necessary.
- Point out situations and runbook details that are difficult and or are not clear.
- Review the notes together.
- Have a mini retrospective (for the runbooks and the service itself, not the driver). Discuss what went well, what can be improved, and create action items from them.
- Examples of action items: Clarify details in the runbooks. Run another disaster simulation with different variables. Add logging and metrics where necessary in the service so it’s easier to debug what the problems are.
An important thing to point out is that it can be burdensome to figure out what kind of disaster to simulate. It can be extremely difficult to come up with an entirely new problem scenario, especially when you haven’t experienced it before. When thinking of what kind of disaster simulation to run, use past incidents as examples, ask others for help, and search online for ideas of what types of problems may occur in your service.
Thanks for reading! Hopefully you find running a Disaster Simulation beneficial like it has been for me and my team. If you are just trying this out, good luck and let us know how it goes!