How much are you learning from your postmortems?
Young startups often follow a familiar narrative: in the pursuit of product-market fit, engineers march to the drumbeat of “move fast and break things”. The company values speed of execution and agility over everything else. However, as systems become more complex (for example, through multiple product pivots), failures happen. Eventually, as the business gains significance and prominence, incidents and breaches become increasingly painful and costly. The company looks to its larger peers for guidance, and finds Google’s SRE book, or chances upon some of John Allspaw’s writing.
Following the book’s recommendations, a process is put in place to trigger a postmortem for every production incident. After a few of those, the development teams inadvertently learn about the ETTO principle: since we have bounded time, we have to trade off between efficiency and thoroughness.
I have noticed 2 common shortcomings with postmortems:
- a reluctance to name individuals and study their mental models, resulting in shallow analysis, and
- a drift toward bureaucracy and short-term patches, especially when the real, durable solutions are expensive to implement.
What Were They Thinking?!
In practice, blameless postmortems are not really blameless, but they can be sanction-less. This safety is an important and necessary start to allow investigators to uncover all of the pertinent details, but is insufficient to influence the desired change.
Most of the time, blame is unavoidable. People built the software and the tools that contributed to the incident, even if individuals did not directly contribute to it. Often, in an effort to be “blameless”, well intentioned authors of postmortems leave out important key details, avoid naming names, and overuse the passive voice. Ironically, this behaviour obstructs learning and invites gossip and blame.
I see most organizations end up doing a “blame dance” of sorts where we know we’re not supposed to say whose fault it was, but we’re still thinking it, which manifests itself in passive ways. — Introduction, Learning from Incidents in Software
In addition to naming key actors, investigators must also delve into their pressures and influences, and understand the context in which they operated. Consider the following example from the Postmortem Culture in Google’s SRE book:
An action item to rewrite the entire backend system might actually prevent these annoying pages from continuing to happen, and the maintenance manual for this version is quite long and really difficult to be fully trained up on. I’m sure our future on-callers will thank us! — Postmortem Culture, Site Reliability Engineering
I can guarantee that “rewriting the entire backend system” did not happen as a result of this postmortem. In fact, this action item is an excuse, and suggests that the postmortem excluded the vital questions that would have helped the groups involved improve.
To move beyond blame, it is important to focus on learning by investigating contributing factors and debugging the organization in order to find durable fixes. We can move toward actionable and lasting change by asking the additional deeper why. By applying systems thinking, we can ask the following questions and extract the causal loops:
- What assumptions did X make when designing and implementing this component? Are those assumptions still correct?
- What should be done when assumptions that were true a year ago, are now false? How can we detect this change?
- What influenced Y to take that action when they were paged?
- Why did Z decided to defer the implementation of the guardrail that could have prevented this issue? What factors influenced the prioritization process?
- Why haven’t we (the organization) prioritized the work to rewrite the entire backend system? What is the obstacle?
Notice that we are not applying judgement — we are not deciding if their actions were right or wrong. Instead, we are recognizing that their cumulative choices and decisions resulted in the incident, and asking the additional why. Moving beyond the superficial causes of the issue, these questions prompt us to analyze the organizational behaviors that generated the situations which caused the latent flaws in the first place.
When writing a postmortem, remember to ask the deeper why, “is there anything in our systems, structures, and processes that increases the likelihood of error?” or simply, “what were they thinking?”
Zing! Now you got to write a postmortem!
According to the SRE book, companies should trigger postmortems based on a well defined criteria. This ensures that there is sufficient review coverage of every incident, and establishing a clear criteria also signals an intent to be transparent and fair.
However, these trigger conditions can fall out of date, as processes often do when companies scale and evolve rapidly. They may no longer be optimized to surface the most important learning opportunities. Without meticulous care and regular calibration, the incident review process predictably degenerates into busywork, producing a steady stream of postmortems that diffuses our collective attention and energies.
Additionally, the increasing number of follow-up tasks can overwhelm the organization, and even subtly influence a bias for quick duct taping and manual processes to solve immediate problems. Examples include:
- requiring additional approvals before deploying a new data science model to production; or
- requiring two operators for any database-related change in production.
Manual gate-keeping, ironically, imposes additional overhead on precisely the very teams and individuals that the organization is relying on to deliver the durable long-term improvements it sorely needs. This myopic approach to “get the postmortem over with” can become demoralizing.
Instead, to maximize learning opportunities (a.k.a. ROI) from each investigation and better focus our energies, the triggers for postmortems must be regularly calibrated against the prevailing engineering standards. Consider the following examples:
- after some detailed analysis, the organization has come to accept a known defect for the year, while waiting for a migration project to complete; or
- after weighing market opportunities against development costs, the company knowingly makes a deliberate trade-off to develop a proof of concept or enter an emerging market.
If these are known risks that the organization has understood and accepted in pursuit of growth, then it is counter-productive to generate postmortems for incidents that are caused by them. While it might be useful to maintain a reminder and commitment to build sustainable solutions, postmortems are hardly the best way to do that — at some point, you just want a counter, or a running log to record incidents with common contributing factors.
There are many ways to do this, such as
- using a bug tracker, recording incidents either as comments on the ticket that describes the root cause, or as explicit tickets, each describing the details of the incidents, and linking to the ticket that has the root cause; or
- appending to an incident log on the first postmortem describing the root cause that stems from the accepted risk, or a document that memorialized the trade-off decision.
That way, even without triggering a postmortem, we can still record all of the context and data necessary for engineering managers to make prioritization decisions, and we can present a business case for funding the sustainable, durable, long term solution.
Otherwise, generating postmortems is just toil and paperwork, and it devalues the culture of learning that we have painstakingly built. Detailed investigations and incident reviews should be reserved for incidents that provide learning opportunities. To take this even further, one could also consider prioritizing incidents based on the surprise factor, and may even find time to investigate near misses (which in some cases could teach us even more.)
Ask: did we expect this to happen?
The ETTO fallacy is that people are required to be both efficient and thorough at the same time — or rather to be thorough when with hindsight it was wrong to be efficient! — The ETTO principle
Conducting incident reviews is an important aspect of building learning organizations. It is even more important to have the courage to ask the direct questions and get to the real issues, instead of filing away endless postmortems and “avoiding blame”. Even with recognized long term defects, there is never a single person to blame, because the organization understood the risks and made the decision, picking a position on the ETTO sliding scale. This is the cost of doing business.
Ultimately, our shared goal is to build resilient systems: systems that respond, monitor, learn, and anticipate. In our case, the system is frequently the entire organization — its software, hardware, people, and processes.
Thanks to Jacob Scott for reviewing a draft of this post.
- Nora Jones, 2019. Introduction, Learning from Incidents. https://www.learningfromincidents.io/blog/learning-from-incidents-in-software
- J. Paul Reed, 2019. “Blameless” postmortems don’t work. Here’s what does, Tech Beacon. https://techbeacon.com/app-dev-testing/blameless-postmortems-dont-work-heres-what-does
- Sweta Ackerman, 2018. Post-mortems to the rescue, Increment. https://increment.com/documentation/post-mortems-to-the-rescue/
- Erik Hollnagel, 2016. The ETTO Principle — Efficiency-Thoroughness Trade-Off. https://erikhollnagel.com/ideas/etto-principle/index.html
- Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy, 2016. Site Reliability Engineering: How Google Runs Production Systems. https://landing.google.com/sre/sre-book/toc/index.html
- Marilyn Paul. Moving from Blame to Accountability, The Systems Thinker. https://thesystemsthinker.com/moving-from-blame-to-accountability/
- John Allspaw, 2012. Blameless PostMortems and a Just Culture, Code As Craft. https://codeascraft.com/2012/05/22/blameless-postmortems/