Faking fires: Get better incident management with practice
Fire drills are a familiar concept to anyone who has ever worked in an office. Periodically the fire alarm will start to blare and a few team members will grab some high-vis jackets and help everyone file out of the building to a designated meeting spot.
The reason we do fire drills to be prepared in the event of a fire and to practice everyone's response to such an alarm.
This is also the case for fire fighter who train to respond to those fires. They too, practice how they handle the situation when they get called to a fire.
They have roles that need to be assumed and tasks that need to be assigned to help battle the blaze. All this will be happening during a high stress situation in which people want to know what is happening, how it happened and how much damage was done.
To optimise their response they will often practice how they respond to a fire, how they will fight the fire and how they investigate the cause of the fire.
Practice makes perfect
Practice is an important concept in human psychology. We practice to gain confidence in certain processes or skills whether that be learning a musical instrument, driving a car, learning a new language, responding to a fire alarm or handling the outage of a software platform.
The more deliberate practice we have with a certain activity, the more confidence we have when we need to perform that action.
When running software in production it is only a matter of time before something goes wrong and someone is called out to look into it.
This can be a stressful event, the monitoring systems are alerting, the client might be complaining and internal stakeholders are enquiring as to what is happening. Due to the interruptive and disruptive nature of these incidents we try to minimise the frequency and duration of them as well as minimising the stress of handling these events.
It is for that reason that we deliberately practice our incident handling and response procedures at Kudos. We call these Incident Drills and these are based on how Google SRE’s perform their Wheel of Misfortune role playing games.
Running an Incident Drill
We perform an Incident Drill once a month. This involves the entire engineering team and it plays out like a classic desktop role playing game.
One person will act as the game master or Organiser. They know the details of the incident and will typically have the details prepared before the drill.
This incident could be devised from some testing or could be a previous incident. The information is recorded in an Incident Playbook which looks like a post mortem document without any of the retrospective comments or actions.
The Incident Playbook will have sections like Trigger, Impact, Resolution, Timeline, Supporting Evidence and the Root Cause. The Organiser will refer to this document when running the incident and to answer any questions that arise during the incident.
The organiser will start by laying out the scenario to a member of the team using the Trigger. The person responding will triage the incident and determine if they need to invoke a Major Incident response.
In a Major Incident response, the initial responder will become the Incident Commander and will delegate other roles to other members of the team. These roles include Subject Matter Experts who will be investigating the incident and be performing deep dives into the problem, Incident Logger who will be keeping a record of the incident in a collaborative document and a Client Liaison who is responsible for communicating the state of the incident to both internal and external stakeholder.
The Incident Commander will then be the source of truth for the incident and will managing how the responders will proceed. They should not be performing any of investigation directly and should maintain a top level overview of the incident. They will need to keep level headed, keep the team moving forward towards a resolution and challenge any assumptions made during the incident.
In a situation where the Incident Commander is the most relevant Subject Matter Expert they will need to delegate the Incident Commander role to someone else so they can concentrate on debugging.
Throughout the incident the Organiser will provide details of the state of clusters, servers and services as well as any details that are found in logs or monitoring dashboards. The Organiser will also act as any stakeholder such as clients or internal teams.
Once the incident is resolve, the team review the incident record and work that into a post mortem document.
One of the main goals of the Incident Drill is to learn and improve. So a blameless post mortem will be held and the team will reflect on how the incident was handled, how the incident occurred and any actions that could prevent it.
This is all recorded in the post mortem document and the actions will be turned into tickets in Pivotal Tracker and feed into the sprints to help build resilience.
It is crucial that the post mortem is blameless as it help to promote openness and learning throughout the team. If one engineer is able to break an entire data centre with a single command then more technical controls are needed to prevent a single engineer from making that mistake.
Unfortunately we are all human and humans make mistakes. The goal is to try and minimise the impact of a mistake.
When designing new services and infrastructure, we make sure to build reliability and resilience into them. This means a heavy focus on automation and self-healing applications, which consequently makes finding scenarios in which the team would need to invoke the Major Incident very tricky.
However, we have carried out a number of Incident Drills now, each of them covering different types of failure scenarios.
Each time we do one of the drills, we always find ways to improve the reliability of the platform or put technical controls in place to prevent any occurrences of that incident.
Recently we performed an Incident Drill of the loss of a Kubernetes cluster. We played through the scenario and were able to establish things like the impact, recovery plan and checks to ensure that the cluster came back online correctly.
About a week after we drilled the Kubernetes outage, we had one in production. The team were quickly able to get the cluster back up and running and verify the impact and validity of the services on the cluster.
Events like this help to solidify the reason we perform these drills. To help the team and the company to understand the importance of incident readiness.
I think that drilling things like incident handling and communication are vitally important to how you run software in production. Very much like how fire fighters will drill how they respond to fires to build confidence, minimise panic and ensure that procedures are carried out with practiced hands.
When you create a safe and controlled environment for handling these kinds of incidents it helps to stimulate thought processes for resilience and reduces the stress and panic of an outage.
With the whole team involved, this also helps share knowledge of tooling used to debug and techniques that are useful.
If all this sounds interesting to you, why not consider joining Kudos? We also have a primer on what you can expect your first day to be like.