Improving Incident Learning Part 4

Andrew Hatch
SEEK blog
Published in
8 min readAug 11, 2020

“The illiterate of the 21st century will not be those who cannot read and write, but those who cannot learn, unlearn, and relearn.” — Alvin Toffler

In the final part of this blog series we discuss how to adopt a different approach to learning from incidents. In particular, we’ll focus on creating a more supportive and psychologically safe environment. This provides a much greater awareness that the systems we are working with today are a lot more complex than we perceive them to be. Furthermore, we understand that we cannot know (without considerable and costly engineering work), what all their possible integrations and dependencies are. And therefore we are better placed to focus our effort on ensuring we keep doing the good things right, and consistently, as opposed to being obsessed about avoiding failure at all costs.

In this final blog post we’ll answer the following questions:

  • What do good post-incident reviews look like?
  • What are the positive outcomes to have emerged from running better post incident reviews?
  • How can we continue to sustain our adaptability as we build more complex systems?
Photo by Markus Winkler on Unsplash

Having better Post-Incident Reviews

There are many online resources that you can use to help you conduct better post-incident reviews, most notably the Etsy Debriefing Guide. This guide is free and provides a highly comprehensive, well-researched way to conduct better incident reviews — supported by reference material gained from multiple industries and academia. As a starting point this is an excellent resource and worth adapting to any organisation.

The remainder of this section will focus on what you must create — the environment for learning.

Creating a safe environment

Photo by Clarissa Watson on Unsplash

An environment that is psychologically safe and not set up for the attribution of blame, will always yield richer and more valuable information than one that adopts punitive measures. The emphasis must be on learning and framing incidents for what they really are — an unplanned investment to improve the overall resiliency of your systems.

However despite our best intentions, an incident post-mortem can still be stressful for some people. The level of stress will depend on a number of perceived factors such as:

  • Will it reflect poorly on their perceived worth as an employee?
  • Will it give a negative perception of their intelligence or professionalism?
  • Is it a threat to their annual bonus or remuneration prospects?

Many of these feelings can originate from experiences in previous workplaces, life, relationships, schooling, or even their experiences working in your organisation. Being able to predict how some people will respond to a post-mortem is not easy. But if you feel there are strong emotions evident, consider interviewing people one on one, prior to the incident review. Don’t conduct a group session where group-think and the Hawthorne effect will limit what they are willing to share.

Tip: Start a post-mortem on a positive note by thanking the participants for finding an exception case in our system and providing an opportunity to improve our work and learn something new

Put yourself in the operators shoes

It’s important to understand the pressures people were facing when they were intimately involved in dealing with the incident. Both during the incident, but also before and afterwards. Knowing these pressures, and being able to empathise with incident responders, enables broader understandings about why they took the actions that they did. It also explains the cognitive reasoning performed during joint activities stabilising systems. Doing this will make the incident post-mortem feel less like a box-ticking exercise and more like an experience of group dialogue between people.

Tip: We’ve learned that post-incident facilitators with strong backgrounds in the technology stacks and platforms we use and/or the systems involved is beneficial. They’re more intimately aware of the complexities and nuances of the work done by the incident responders and the symbolism and technical terminology/acronyms used in the discussion — a common situation in software engineering

Don’t lead conversations

An important trap to avoid is driving discussions that will bias the conversation and reflect only your view of the incident. Keep in mind that you are there to facilitate the discussion, not to drive it. Seek to understand what were the environmental factors and inputs. Also, the feedback loops that led the operators to make the decisions at the time when they made them, and what tradeoffs were made under pressure. Ask open-ended questions by broadening the conversation to discuss external pressures and forces acting on the teams i.e. lift the conversation beyond just being an exercise of trying to isolate down to a few contributing factors and look for other cues such as:

  • De-prioritisation of quality improvement tasks in favour of building more product features
  • Key personnel on leave
  • Executive or senior management deadline pressure
  • And many more.

I have been involved in many incident reviews that list quality trade-offs and unhelpful external pressure as being contributing factors to the impact and duration of incidents

Positive outcomes from better incident review processes

Photo by Clay Banks on Unsplash

The most value that comes from better incident review processes is learning and knowledge.

So why is this so important?

Because it is only through the continued learning and the building of knowledge that enables us to create more informed decisions and focus our efforts in the right areas. And, in the case of incident learning — creates the space for us to engineer greater resiliency and robustness in our systems by broadening the awareness of system complexity across our organisations.

Here are some more benefits from our experience:

  • Greater prevalence of teams coming together to build common solutions to improve systems that have become brittle and neglected over time
  • Greater enthusiasm to sacrifice “Sacred Cow” systems — a colloquial system classification given to one that evokes fear in the minds of all who try and change them due to the cascading set of failures they can potentially trigger, incurring the wrath of business stakeholders
  • Platform teams placing greater emphasis on common tooling to solve real problems for the product engineering teams with a stronger focus on security and reliability
  • Post-mortem participants being more willing to volunteer information when incidents occur, no matter how politically unpopular such factors may be, because it is safe to do so
  • The understanding that incidents are inevitable, and our systems exhibit all the characteristics typical of complex systems — this means trying to manage them using linear, reductionist based thinking is futile.

Sustaining our abilities to adapt

Doing good post-mortems, capturing valuable information and shortening feedback loops back to engineering teams does not mean you are going to build the highest level of resiliency in your systems. Although it will certainly help.

Proactive learning

Your systems continue to grow through updates, enhancements and rebuilds (all of which increase, concentrate or distribute complexity in a constant rate of change). Learning from incidents becomes one part of a greater effort to improve software engineering practices. This enables an engineering culture that values its ability to “proactively” learn from failure and not simply “respond” to it. Advanced level pratices such as Chaos Engineering, are now becoming fundamental considerations for organisations to adopt in order to cope with complex software systems as they grow.

Chaos Engineering: System Resiliency in Practice is a recent publication that documents a number of use-cases about how organisations have adapted these practices and is a worthy investment of time and consideration

Resilient practices at the edge

Photo by Josh Rangel on Unsplash

As we have learnt throughout this blog series and previous ones, attempting to silo software engineering practices is an anti-pattern for modern organisations. Or to put it another way, much like we realised long ago we did not need a team called DevOps, we likewise, do not need a team called Incident Learners either!

Outcomes from incident learnings are much more valuable when they become solid feedback loops to your engineering teams, supporting the building of more resilient software at the edge i.e where the work is done.

Organisations realised long ago that simply establishing a security team won’t protect against having security incidents. Moreover, when designing and building systems, it is more effective to build in security at the edge. Lean on specialist Security engineers during the development process, rather than wait for it to be engineered after the code is deployed. In organisations with AWS Accounts numbering in the 10’s or 100’s, this adaptation of security practices is more important than ever. It only takes a few mistakes in storage or compute configuration policies to expose sensitive data to the internet.

Like modern approaches to Security, the ability to learn from incidents is no different. It needs to become a broadly adopted and measurable practice, a continuous feedback loop into software development performed at the edge, and an important capability of software engineering roles.

The Adaptive Capacity Labs publication, Markers of Progress in Incident Analysis is a great resource to measure how well you are performing as an organisation

Challenges for technology leadership

Our fear of complex systems that drives us to dissect the complex system into sub- systems, leading to diverting management attention to chase local optima which are not in- line with the global objective — Dr Eliyahu Goldratt

Adapting many of the practices discussed in this blog series could rightly be seen as a sizeable challenge for some software engineering organisations. Furthermore, it is reasonable to expect much of it will be rationalised as only existing in the “domain of problems” that big Technology companies deal with. But the unrelenting pace at which software can be developed continues to accelerate, increasing complexity and becoming a force multiplier for incidents in the process. So not having a strategy to increase proactive learning from failure is a dangerous situation to find yourself in.

Sticking with mantras such as “if it aint broke don’t fix it”, or “that’s just how we have always done it”, will not sustain your ability to adapt to complexity as it grows. Nor will it prevent the inevitable drift into failure of your systems as deviations from imagined, steady-state behaviour remain normalised over time.

Much like climate change we cannot rely on our ability to “architect/design our way out of a crisis”, nor can we approach all problems with a reductionist, break-fix mindset either.

Hopefully this blog series has provided some context and ideas to help improve resiliency in your organisation. There is a wealth of freely available information on the internet and numerous publications to draw on. But don’t just adopt practices in other organisations, seek first to understand your unique challenges and adapt your practices to suit.

--

--

Andrew Hatch
SEEK blog

Father, Santa Cruz Surfer, fiddler of old Datsuns. Engineering resilience as best I can