Improving Incident Learning Part 1

Published in

SEEK blog

5 min readMar 19, 2020

The more complex our systems become, the more ways in which they surprise us by behaving in ways we don’t expect — that is what several years of incident data and decades of safety research conducted in other industries tell us.

This is a companion piece to a talk that was presented throughout 2019 and 2020 on Learning from Incidents at SEEK. Equal parts war stories, historical perspectives and some inspiration based on years of research by experts in fields of safety.

In summary we will cover in this series:

The fire-fighting days of incident management where learning was retained only in the minds of incident responders;
Why early attempts aimed at preventing incidents using Taylorist derived work-as-imagined control methods did not work well, and a theory on how we came to use them in the first place;
A new perspective, seeing the system being of equal parts human and technology, creating an environment of psychological safety and support when facilitating incident reviews to establish greater learning and deeper insights without reprisals or reprimands; and
Resilience Engineering for the future, a new vision for complex technology systems and the people that build them.

The fire-fighting days of incident management

For a number of years we have been continually building and refining how we manage and report on incidents. Our original focus, born of a desire to control and prevent incidents from happening, eventually led to greater efforts to improve visibility and how we communicate, report and facilitate them. The goal is to have better incidents, not just prevent them.

Over time this has led to a gradual shift towards learning and more importantly, using learnings to identify and improve resiliency where levels of risk homeostasis (either anticipated or identified through patterns of function and behaviour), are higher than we would like them to be. Incidents are now seen as a normal part of system function, an outcome of living with and building complex systems, and should be seen as an unplanned investment in identifying where efforts can be focused to increase the resiliency of systems.

However there was a time, quite a few years ago, when incidents ruled our operations team lives. They were managed and resolved centrally, and in isolation, such that most people in the business had no idea they were occurring with such frequency. Burnout and low morale were common in operations teams and the nature of their work became so reactive and unplanned that any learnings from incidents only existed in the collective recall of the team's memory.

*DevOps @ SEEK circa 2014/5, Zombie Ops Firefighters from hell*

Our version of incident management was a library of hastily put together scripts, manually documented processes, checklists and a long (mostly untracked) legacy of patches, workarounds and customisations that turned every system into a pampered pet needing constant feeding and watering. The impact on our platform engineering team has been well documented, the impacts on our customer and application support teams has not been, but they fared no better.

Moving to DevOps as a Culture in 2015 vastly changed how we managed incidents. As teams took control of the operational support of their systems, incidents were resolved faster, as the engineers who wrote the actual code were the people responding to them. Our tooling and monitoring landscape changed to meet the demanding needs of these teams as they were accelerating their ability to get new products to market and needed more flexibility in managing oncall duties

Fundamentally two things were starting to happen when it came to learning from incidents:

Incident knowledge became much more localised to the teams, meaning specialised islands with deep incident experience started to flourish;
The sum of all incident knowledge became highly distributed, but learnings were largely undocumented, meaning broad or systemic patterns were still not being recognised.

Decentralising teams had a number of early advantages when incidents occurred, namely that incident resolution happened faster as support responsibilities were managed locally and not outsourced. This increase in localised deployment and support knowledge enabled products to be built faster and deployed more often. But conversely the inherent complexity within our existing systems was growing too. Soon it became impossible to determine the edges of the envelope for system resiliency and track the growing rate of changes and their dependencies.

A large increase in distributed integration points emerged, all featuring a mix of tight and loose couplings with each other, depending on how they had been built and what localised operational decision making in the teams was being taken to keep them running. Components in the system were changing frequently, sometimes just to compensate for latency or degradation (either real or assumed), leading to symptoms of emergent behaviours that had not occurred before.

January 2017 — When the canary started to choke

With complexity growing, incidents started to escalate as the exponential challenges of building and integrating more systems at scale began to overwhelm previously provisioned AWS Infrastructure and services. Ultimately we had a combined total of 18 hours outage and over 10,000 impacted customers in a single calendar month.

Our incident black swan event was about to happen…. it didn’t take long.

May 2017 — Then the canary died

When the Wannacry virus made headline news in May 2017 we immediately initiated an emergency round of patching and upgrading our servers. Part of this upgrade coincided with updating our DNS server in our Data Server to Windows Server 2016. Unknown to us, and Microsoft, this first edition of the operating system shipped with a lookup failure which was not immediately apparent for Windows Servers in the same network (because Windows caches DNS entries locally).

This failure broke forwarding lookups into AWS and anything else relying on these DNS Servers, exposing a major flaw in our AWS VPC networking. Because we had set almost all of the DHCP option sets on our VPC’s to route back to the Data Centre DNS Servers via the Direct Connect link. This caused all DNS lookups to fail. Across the board. Think of it like your cell phone being unable to dial any number even though it is telling you it is connected to the network.

The end result was 12 hours of continuous outage, 10’s of 1000’s of customers impacted, call centre inbound lines going into meltdown and a mild but (mostly) controlled panic gripping the incident war room as we frantically configured host files and DHCP options to use AWS R53 until the sites and services started working again

The Senior Management reaction was, rather predictable, given recent history of incidents leading up to this…

And the directive down from Senior Management was pretty clear, get control of the situation — because you are not in control.

In the next post we’ll talk about what those early control measures looked like, why they failed and suggest a theory on why people chose to enforce them.

Improving Incident Learning Part 1

The fire-fighting days of incident management

Written by Andrew Hatch