From accident to investment: How to run better blameless postmortems
We held our seventh annual Code for America Summit in June, and we were fortunate enough to have John Allspaw, Founder of Adaptive Capacity Labs and former CTO of Etsy, join us as a speaker.
You may have already seen his main stage talk, where he introduced the concept of the blameless postmortem: knowing that software systems get increasingly complex over time, and acknowledging that complex systems fail, we must have a way of responding to those failures that leads to learning and continuous improvement. Blameless postmortems are part of such an approach that emphasizes forward-looking accountability and enables us to turn accidents into real investments in the future.
I had a chance to join John for the breakout session he led following his main stage talk, and it was a great opportunity to learn more about putting this approach into practice. In fact, I left with immediately actionable takeaways that I could bring back to our team at Code for America.
You see, I’ve been a fan of the blameless postmortem since I first came across John’s influential writing on the topic in the Etsy engineering blog back in 2012. That led me to Sidney Dekker’s books on human factors and safety, including Just Culture and The Field Guide to Understanding ‘Human Error’. Having spent the previous decade or so working on rather complex, high volume systems — and experiencing failure firsthand, more times than I care to remember — the idea resonated with me: some of the most valuable learning opportunities exist in the wake of failure, and those opportunities are often squandered. These occasions are so often reduced to an individual admitting they made a mistake, swearing it won’t happen again, and maybe adding a few action items that allow people to walk away satisfied that it really won’t happen again (spoiler alert: it will).
So, I swore I would help my team avoid that trap, and started implementing blameless postmortems. I didn’t have any training in postmortem facilitation, nor did I have a blueprint for how to actually run an effective one beyond my own experience and what I had read. Still, I came up with a simple template, wrote up my own guide for facilitators, and set up a repository and an email list. I got everyone to commit to doing postmortems in this fashion whenever we had a failure, and sharing them with the entire team for transparency. So far so good.
And it was good. Simply putting a stake in the ground and showing your genuine desire to make postmortems a blameless learning activity, and to empower people to own the discussion and the remediation after failures, is a great first step. Many of the postmortems we conducted at my company then (and now at Code for America, where I also embedded the practice from day one), felt truly blameless and resulted in real learning.
But not always.
Facilitating a blameless postmortem takes practice and skill, so it’s not surprising that sometimes it’s difficult to guide the conversation to be productive. But after learning more from John’s breakout session, and diving into Etsy’s excellent facilitation guide, I realized that there were some things about the way I had implemented the approach that were actively working against my goal of creating an environment for learning in the wake of failure. And there were a few key things I needed to grasp fully in order to change that.
Descriptions, not Explanations
There’s an incredible urge to ask the question “Why?” during an incident review — and plenty of people out there who would tell you to ask it four more times — but it’s actually an urge we should resist in this context. That path leads to speculation, to judgment clouded by hindsight bias, to blame, and to everyone’s favorite remediation item: “Next time, do what you should.”
Instead, we should focus on getting as rich a description as possible of the events surrounding the incident from every participant in the postmortem. The facilitator should encourage everyone to share the details of what they did, how they felt, and what they were thinking—including, perhaps most importantly, the things they take for granted and wouldn’t normally think were worth mentioning. These details often help the other folks in the room get a glimpse of what it’s like to play another role on the team, and that is precisely the kind of learning that makes these sessions so important.
We had already been using a timeline to guide our discussions at Code for America, and that’s a good approach; but we hadn’t been explicit enough about focusing on descriptions over explanations. Going forward, we’re actively steering people away from explanations, and focusing on mining for descriptive details around the key junctures (when decisions were made and actions were taken) in the incident timeline.
Remediation Is Not the Goal
I used to emphasize learning when describing blameless postmortems, but in practice I would still focus on identifying action items. John’s session at Summit helped me to realize that we would learn more if we were explicit about learning as the sole goal of the meeting, and equally explicit that producing action items is a non-goal.
Taking a page once again from Etsy’s facilitation guide, we modified our process. We now have a place to capture possible remediation items in our postmortem template so we don’t lose ideas as they come up, but we make it clear that we no longer expect to leave the meeting with specific and actionable remediation items that the team has committed to. Rather, we capture generative ideas for improving the system that come up during discussion, and we leave the exercise of turning those into tracked tickets to a follow-up session that is led by the postmortem participants after they’ve had some time to digest the discussion.
Frame the Meeting, Every Time
No matter how many times you have done this in your organization, it’s important to re-introduce the purpose and guidelines of the meeting every time you start one.
Taking a couple minutes at the outset to lay out the norms for the meeting— clarifying the goal of learning, the non-goal of remediation, the need for description over explanation, and the intention of being utterly blameless—is always time well spent. It sets the tone for the meeting and helps people open up and share. Etsy’s facilitation guide has your back again: it includes an excellent sample introduction that you can adopt and make your own.
Postmortem Facilitation is a Practice
Just like pair programming, iteration planning, or having effective 1:1s, post-mortem facilitation is a practice to be studied, evolved, and shared. When I first implemented blameless postmortems, I put a lot of effort into preparing for the meetings and trying to ask the right questions. But I didn’t sustain that level of investment, nor did I invest in training and motivating others. As a result, we didn’t benefit from a continuously evolving and improving facilitation practice.
As a facilitator, it’s critical that you take the time to focus, prepare, and bring your best self to the meeting every single time. And since the goal is for your entire organization to learn from failure, it’s not something you can do alone. Nor can you simply delegate to others who haven’t had the necessary exposure and training. Like with any other practice, training and repetition are key. On my team, we’re starting to have more engineers play the role of shadow facilitator with the intention of growing and maturing our facilitation practice over time.
We’ve only had a handful of postmortems since Summit, but I can already say they have taken on a different tone and feel much closer to the spirit in which they were started. I can’t thank John and the team at Etsy who collaborated on their facilitation guide enough for helping our team at Code for America level up our blameless postmortem practice.
I’m the CTO of Code for America. We’re making government services work, starting with people who need them most. Join us.