Improving Incident Learning Part 2

Andrew Hatch
SEEK blog
Published in
7 min readMar 25, 2020

This is part 2 of a companion piece to a talk that was presented throughout 2019 and 2020: Learning from Incidents at SEEK

To avoid what became known as The Great DNS Outage of 2017 (and a record number of incidents for some months prior), a new and more detailed incident management process was established. This new process was designed to provide a view of the perceived lack of control over the rapid rate of system change, and the unchecked instability of system performance and ever-increasing outages.

One of the first tasks this new process sought to address, was the lack of detailed information on past incidents. A lack of historical detail meant that a view of the frequency and severity of incidents overtime was not possible. So to gain better insights, the we instantiated the establishment of post-mortems for all high-severity incidents. The primary aim of the post-mortem was to gain as much details from an incident as possible, with a strong focus on lagging indicators. We were also careful to make it a blameless process, that was just focused on facts. This ensured participants would feel safe to disclose details, be proactive about reporting system failures when they occurred, and not wait to be asked if something was going wrong when it inevitably did.

Photo by Emily Morter on Unsplash

The root cause may not be the sole cause

Post-mortem facilitators were encouraged to use the 5 Whys of investigation technique to identify the root cause of the problem. During the process, participants were also encouraged to identify a suite of localised action items for remediation, to avoid the incident occurring again. The post-mortem and action-item data once gathered, was written up and stored in different document and incident register systems.

Once a month the incident and post-mortem data was manually extracted and collated to produce detailed reports for senior management. These reports illustrated the statistical changes over time of incident frequency and severity.

Looking back there were some challenges with this initial investigative approach.

Firstly, by focusing on the root cause, facilitators would only hone-in on the one factor believed to be the sole cause of the incident. Other contributing factors were not explored in depth, meaning all possible and valuable learnings were not uncovered, documented or shared. Secondly, the systems used for storing post-mortem and action-item data were not integrated into the systems used by the engineering teams to manage their day-to-day work priorities. This meant correlation and referencing of incident data in real-time was not easy and became quickly forgotten.

Thirdly, as the incident and severity reports were only being produced to focus on how the new processes were regaining control over system instability, and not what was being done well, it meant they were not seen as a valuable resource by teams to build more resilient systems.

The weekly incident review — we didn’t make it too comfortable

Every week on a Friday afternoon operators involved in incidents that impacted customers were asked to attend a meeting to replay the post-mortem information. They would need to answer further questions, receive more action-items specific to their localised areas of control, and ask for closure on action items already remediated.

These meetings were chaired by senior management who had been tasked with improving system reliability and stability. For many engineers the thought of attending a meeting like this, after already experiencing the psychological distress of a major incident only a week before, was not a pleasant one.

These feelings were made worse when they were told upon entering the meeting that it was “not intended to be too comfortable”. In other words we were prejudicing people before they arrived, believing they simply weren’t doing what we expected them to, and therefore making the assumption that the root cause of most incidents was, really, human error.

The Hawthorne Effect was high in this meeting and the mood of people entering and leaving was also, quite predictable.

Psychological safety — you’re doing it wrong

The human error conundrum and hindsight bias

One way that we can fall into the trap of seeing problems as human error is when we are not intimately familiar with the complexity of the work being done. As complexity within the system grows, the gap will widen from how we imagine work is being done, versus how it actually is being done.

In our case the increasingly rapid rate of system changes, as teams moved faster with greater autonomy, meant the technology landscape and the way in which people worked was moving quickly too. The ability to “put yourself in the operator’s shoes” was becoming difficult and getting harder as time went on.

This absence of an“empathy for complexity”, was why many questions management asked in incident reviews, already affected by the presumption of human error, could be prone to falling into the trap of hindsight bias.

How to Hindsight Bias

Furthermore, questioning often resulted in long-winded, technical discourses that were more focused on the minutiae of the post-mortem, and less on the broader more systemic issues. Non-technical people began to drift out of these meetings (as these overly technical conversations were not inclusive for them), further isolating the engineers and making incidents seem a “tech problem”.

But incidents did decrease…

Despite the challenges with post-mortem data collection and incident review questioning, an interesting phenomenon started to occur. The monthly reporting data was showing that incidents were broadly starting to trend downwards. In fact there were several months during 2018 when no high severity incidents were recorded at all.

Goodwill messages were sent out every month when no high-severity incidents occurred. Everyone was broadly commended for their work in keeping systems operational, not negatively impacting customers, and the processes put in-place were championed and maintained.

Or so we thought….

The reality was high severity incidents were still happening.

Unless other teams happened to ask questions in public slack channels if something was not working as expected, or customers rang customer service to complain, they weren’t being voluntarily reported. Due to the notoriety of the weekly incident review meetings some engineers didn’t see them as a safe environment and began avoiding incident reporting and covering up the true impacts of an incident.

Obsessing over the root-cause in post-mortems also meant the data was not thorough enough, remediation actions could resemble band-aid solutions, and incidents still kept happening regularly as learnings were not being shared.

A retrospective on the weekly incident review process surfaced many issues with the initial approach, and it was revamped to create a more supportive environment. This simple change over time led to a renewed focus on system health metrics for all teams which could be discussed easily in wall-walks, provide greater focus and emphasis on improving and decommissioning old systems, and generally make the whole process more open and supportive.

A valuable lesson had been learnt, create a supportive environment with a strong emphasis on psychological safety in an incident post mortem and you will learn a lot more about how your systems are performing from your oncall staff.

But still, high severity incidents kept happening.

So if incidents were still happening, and neither the stick or carrot approach was preventing them…. then… why?

There were two main reasons:

Firstly, as an organisation we were not learning enough from our incidents, we were not looking for the leading indicators and we were not highlighting, enough, what we were doing well and socialising it. We were too focused on the quantitative and not enough on the qualitative.

Secondly, we were not questioning the theory behind the approaches being taken, it was just what people had always done to manage incidents. Get the incident data, find the root cause, publish the stats for emphasis, direct engineers to fix the problem, move on. One band-aid at a time.

This conditioned thought process was why identifying the root cause of an incident became so seductive. It was the “ah-ha!” or “case closed” moment, something which appealed to our natural and instinctive tendencies to simplify complex and difficult problems down using linear, reductionist methods. As technologists we are conditioned to finding the least cost path through complexity, and therefore it is understandable we will see the cause and effect ratio of an incident as a 1:1.

5 Whys to the Root Cause… but what else are we missing here?

This is the reason why localised action items resembled band-aid solutions — they did not address the broader, more systemic issues in our complex system. It’s was like trying to find the answer to signal faults randomly affecting an entire railway network, but only looking at one stretch of track to do it. A sort of remediation action item game of Whac-a-Mole.

However if we can train ourselves to see incidents as part of normal functions in complex and dynamic systems, then we start to think differently about how these systems fail. Moreover, we develop better methods to sustain our adaptation to them and can begin to see the relationship between system and operator as a sociotechnical system.

Our next post will present a theory on how we ended up thinking and approaching our incidents in the way that we did, before diving into how a change in perspective led to a much more proactive way of coping with them.

--

--

Andrew Hatch
SEEK blog

Father, Santa Cruz Surfer, fiddler of old Datsuns. Engineering resilience as best I can