This is part of a series of runsheets on particular things you might want to consider implementing in the DevOps space. We’re going to make this skinny, and we’re going to make sure that everything is as practical as possible.
What Are Blameless PIRs?
Blameless Post Incident Reviews (PIRs) are based on the 5-Whys interrogative technique, to explore the cause and effect relationships underlying a problem. It points towards systemic issues and puts a defocus on human error.
“A bad system beats a good person at any time,” — W. Edwards Deming.
Why Run Blameless PIRs?
Post Incident Reviews are a healthy way to reevaluate the series of events that led to an incident taking place. I credit Aaron Wigley from REA Group for making me understand the value of this — not just as a generator of preventative actions following an incident, but as a way of fomenting a culture where people are not afraid of making mistakes.
When do I run blameless PIRs?
It’s up to you. Some people advocate for running your PIRs for every single incident, but I like keeping processes as lean as possible. Where the main artifact of a PIR is cultural, you want to execute it when it counts: in spaces where cultural agreements tend to collapse, and you want to ensure that culture remains intact.
Here’s a rubric that we use, where P1/P2s are where we run PIRs:
How do I run blameless PIRs?
- A whiteboard and a space. Make sure it’s a space where people are comfortable talking out loud.
- A timeline of events. What happened, and when did it happen?
- The personnel who supported, consulted, and accountable in the incident.
A.) Set ground rules.
- Base assumptions. Everyone does the best job with the can, with the knowledge and time we have available. Accept no compromises.
- No finger-pointing. Omit names if possible, and only talk about events.
- Human error is not a root cause. If you reach an endpoint where it talks about human error, unpack further until you reach a systemic or process problem.
- Facts only; conjecture is discouraged and should be marked if taken into consideration.
B.) Decide on Impact
Decide on a single statement that encapsulates the business impact. And I stress business impact again: we’re not here to solve technical problems — we’re here to provide services that serve the business.
So instead of saying “The identity server was down” try saying “Customers could not access their profile page”. Instead of saying “The instance ran out of memory” say “The home page was inaccessible”.
Establish a timeline. Reference Slack/Zendesk, or any facility recording actions before, during, and after the incident. Make notes of four timestamps:
— start of the incident (which is when business impact first occurred);
— time to detect (when the issue was first detected or acknowledged by personnel responsible for the service)
— time to resolve (when the impact first started to diminish)
— time to stable (when subsequent impact avoidance measures were put into place).
D.) Root Cause Tree
Start a tree by writing out the business impact at the top. Ask why it happened; write out the cause, then ask why that cause occurred. The flow looks like this:
Should a cause have multiple originating causes, fork out and draw them out as well. Do it again and again, until you reach an actionable cause.
From the business impact, write out the Time to Detect and Time to Resolve as separate subtrees. Make a note of the amount of time between the start of the incident and the TTD/TTR. Are you happy with how long it took to detect and resolve?
As per the Impact Root Cause, ask why again and again until you reach an actionable item.
F.) Wrapping up
Take the actions one by one, and assign owners. Owners don’t have to be directly responsible for corrective measures — they just need to make sure that the work occurs.
Set a check-in date, taking into account the risks and priority associated with each action. Write down a report, and share to the larger community.
Join our community Slack and read our weekly Faun topics ⬇