DevOps Runsheets: Blameless Post Incident Reviews

John Contad
Dec 4, 2018 · 4 min read

This is part of a series of runsheets on particular things you might want to consider implementing in the DevOps space. We’re going to make this skinny, and we’re going to make sure that everything is as practical as possible.

What Are Blameless PIRs?

Blameless Post Incident Reviews (PIRs) are based on the 5-Whys interrogative technique, to explore the cause and effect relationships underlying a problem. It points towards systemic issues and puts a defocus on human error.

A bad system beats a good person at any time,” — W. Edwards Deming.

Why Run Blameless PIRs?

Post Incident Reviews are a healthy way to reevaluate the series of events that led to an incident taking place. I credit Aaron Wigley from REA Group for making me understand the value of this — not just as a generator of preventative actions following an incident, but as a way of fomenting a culture where people are not afraid of making mistakes.

When do I run blameless PIRs?

It’s up to you. Some people advocate for running your PIRs for every single incident, but I like keeping processes as lean as possible. Where the main artifact of a PIR is cultural, you want to execute it when it counts: in spaces where cultural agreements tend to collapse, and you want to ensure that culture remains intact.

Here’s a rubric that we use, where P1/P2s are where we run PIRs:

Dirty rubrics over high-fidelity, formalized processes.

How do I run blameless PIRs?

You need:

  1. A whiteboard and a space. Make sure it’s a space where people are comfortable talking out loud.

A.) Set ground rules.

  1. Base assumptions. Everyone does the best job with the can, with the knowledge and time we have available. Accept no compromises.

B.) Decide on Impact

Decide on a single statement that encapsulates the business impact. And I stress business impact again: we’re not here to solve technical problems — we’re here to provide services that serve the business.

So instead of saying “The identity server was down” try saying “Customers could not access their profile page”. Instead of saying “The instance ran out of memory” say “The home page was inaccessible”.

C.) Timeline

Establish a timeline. Reference Slack/Zendesk, or any facility recording actions before, during, and after the incident. Make notes of four timestamps:

start of the incident (which is when business impact first occurred);

time to detect (when the issue was first detected or acknowledged by personnel responsible for the service)

time to resolve (when the impact first started to diminish)

time to stable (when subsequent impact avoidance measures were put into place).

D.) Root Cause Tree

Start a tree by writing out the business impact at the top. Ask why it happened; write out the cause, then ask why that cause occurred. The flow looks like this:

Credit: Journey to Better

Should a cause have multiple originating causes, fork out and draw them out as well. Do it again and again, until you reach an actionable cause.

E.) TTD/TTR

From the business impact, write out the Time to Detect and Time to Resolve as separate subtrees. Make a note of the amount of time between the start of the incident and the TTD/TTR. Are you happy with how long it took to detect and resolve?

As per the Impact Root Cause, ask why again and again until you reach an actionable item.

F.) Wrapping up

Take the actions one by one, and assign owners. Owners don’t have to be directly responsible for corrective measures — they just need to make sure that the work occurs.

Set a check-in date, taking into account the risks and priority associated with each action. Write down a report, and share to the larger community.

Transparency matters.

Join our community Slack and read our weekly Faun topics ⬇

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author! ⬇

Faun

The Must-Read Publication for Aspiring Developers & DevOps Enthusiasts

John Contad

Written by

DevOps. Stories. Guitars. Motorcycles. Melbourne.

Faun

Faun

The Must-Read Publication for Aspiring Developers & DevOps Enthusiasts

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade