What do we do when the magical lightning rock stops working?
Software is a marvel of human ingenuity. It’s a miracle that we can shove electricity into rocks and have it perform unimaginable feats of communication, productivity, and analysis. It shouldn’t be a surprise to us when that magical lightning rock doesn’t quite do exactly what we told it to. When it doesn’t do the thing someone has to fix it.
The problem that I’ve been facing is the many stakeholders interested in the failure of the magical lightning rocks that we use. There were a lot of differing opinions and inconsistent demands from a variety of directions at inMotionNow when it came to moments of failure. I set out to define a few key points in our incident management policy.
- What is an incident?
- How does an incident differ from a product defect?
- When should we write a post-incident report?
- What information should be in the post-incident report?
The first thing I needed to do to get these questions answered was to gather the opinions of the various stakeholders involved. I asked a variety of internal job titles to take a 20 minute survey to answer those exact questions. With such a wide range of specialties and interests the responses I got were surprisingly similar.
- Chief Operations Officer
- Director of Application Development
- Director of Customer Success
- Director of Product Management
- Development Manager
- DevOps Team Lead
- Principal Software Developer
- Product Support Specialist
- Quality Assurance Team Lead
- Scrum Master
- Site Reliability Engineer
What is an incident?
While my friends with those job titles answered the survey I did a bit of internet research to find out what other companies called an incident. I pulled quotes from industry leaders such as John Allspaw, standards like Google’s SRE Book, and from Tim Craig’s talk on Taming Chaos. Ultimately though the definition that resonated with me the closest was from the Atlassian Incident Handbook, a read I suggest for anyone that is involved in incident management.
“We define an incident as an event that causes disruption to or a reduction in the quality of service which requires an emergency response”
- Atlassian Incident Handbook
My colleagues definitions hit on a few important points too:
- An incident is something that occurs when something goes wrong.
- It’s something that negatively impacts one or more user’s experience.
- It interrupts usual operation and results in a loss of ability to use our platform.
- A degradation or a change in the stability, security, or availability of a service.
That’s a lot of definition and, honestly, none of it is wrong. For my stakeholders I decided to distill that paragraph into one statement. It ended up surprisingly similar to Atlassian’s definition with our signature skew towards a user focused mind set.
An incident is an event that results in the loss of a user’s ability to effectively use any part of the platform and requires emergency response to resolve.
How does an incident differ from a product defect?
This was the most difficult question to answer according to my survey respondents. It was tricky partly because for us neither incident nor defect had a written definition for us. It’s not complete chaos. The villagers don’t run around in panic until the shamans fix the miracle magic lightning rock. Our defect handling process is a topic for another post though.
While nebulous for us today the entire team could agree that there was a difference between these two things. One of these things is something that needs to be fixed as soon as it’s learned about. The other is a thing that will be fixed during our normal sprint cycle. The importance of distinction comes from that definition above including the phrase “requires emergency response to resolve.” If you’re not willing to drive to my house and bang on my door at 3AM till I get up and fix the problem, can you really classify it as an emergency?
This question yielded an equally varying number of responses.
- Defects aren’t always seen by a user.
- Incidents don’t have workarounds.
- Defects are the application not working as intended.
- Incidents are caused by defects.
- Incidents are events. Defects are behavior.
- Defects are failures that come from code, not from load or upstream outages.
- Incidents relate to the environment hosting the software.
- Incidents are services performing too poorly to allow the reasonable completion of a task.
- Software bug is a known or unknown defect that can lead to an incident.
- All sandwiches are not hot dogs.
- You’ve blown my mind with this question.
Some of the responses were useful while others were the result of individuals using the wrong format to express their (wrong) opinions about whether a hot dog is a sandwich. There were two responses that really stood out to me: Incidents are caused by bugs and Incidents are events. Bugs are behavior.
Incidents are things that happen. They are a point in time. They aren’t even always reliably reproducible. They’re ambiguous. Incidents are fixed by changing some infrastructure or some config setting. Incidents are the magic lightning rock cracking in half. Defects are specific behavior that the application exhibits under specific conditions. They’re known. They’re fixed by changing the code that governs the bad behavior. Defects are the child throwing the magic lightning rock.
This metaphor’s lost it’s spark by this point.
When should we write a post-incident report?
This one was pretty straight forward. Everyone seemed to agree that we should write a report every time there’s an incident. There was some disagreement on what we should do with recurrences of an incident that have the same lead up and cause but have yet to see a long term fix. That is still a question that’s yet to be answered by our group.
The second question that this brought up was how long after an incident should a post-incident report be given. Some stakeholders wanted their report within minutes, others within 24 hours, and others within 5–7 business days. The immediacy of the report ended up depending greatly on the stakeholder’s role and the information they needed varied by time frame as well.
What information should be in the post-incident report?
As the inevitable writer of these reports, this was the question I feared the most. Incident stakeholders would of course want all the information they could get about any given incident.
I separated the asks into three separate categories: Immediate, Short-Term, and Analysis. I proposed different SLAs with my stakeholders for each of the different categories.
This report should be completed within an hour of incident resolution.
- Start time
- End time
- Category (Degradation, Outage, etc)
- Actions taken to resolve
- End-user experience
- First responders (internal or external, who?)
- Are components still affected?
This report should be completed within 24 hours of incident resolution.
- Management-friendly summary (essentially a tl;dr of the report)
- How many users did this effect?
- How many users reported it?
- What areas of the application were impacted?
- Short-term solution proposals (can we get something in the next sprint to fix this?)
This report should be completed within 7 business days of incident resolution.
- What key indicators led up to the the incident (event timeline)?
- Preemptive measures or monitors put in place to catch this in the future before it happens, if any.
- Was there any data lost?
- Technical explanation of problem and causes
- Any related error logs
- Long-term solution proposals
- Forecast of when this might happen again
This is where we are. Not unexpectedly, I still have a lot of questions. As with everything at inMotionNow though, we will evolve our process as we discover which parts are and aren’t working. There’s more to learn for both me and my stakeholders about these magical lightning rocks, but at least we can start to agree on how to communicate about it going forward.