How Complex Web Systems Fail — Part 1

https://flic.kr/p/zWeRv

Update: Read part 2 of this article here.

There’s this one paper that keeps popping up on my radar. I think it’s about time I give it the attention it deserves. I’m talking about How Complex Systems Fail by Richard Cook. This seminal paper, published in 2000, covers 18 sharp observations on the nature of failure in complex medical systems. The nice thing about these observations is that most of them hold true for complex systems in general, including our beloved web systems.

Distributed web-based systems are inherently complex. They’re composed of many moving parts — web servers, databases, load balancers, CDNs, routers, and a lot more — working together to form an intricate whole.

In this article, which is part 1 of 2, I’ll go through the first half of Cook’s observations, one by one, and try to translate them into the context of web systems. (In part 2, I’ll cover the other half.)

1. Complex systems are intrinsically hazardous systems

This is certainly true for safety-critical systems in industries like medicine, transportation, or construction where errors can mean the difference between life and death. While most web systems fortunately don’t put our lives at risk, the general response to failures is the same: creating defense mechanisms against potential hazards inherent in those systems. Which brings us to the next point…

2. Complex systems are heavily and successfully defended against failure

We put countermeasures in place — backup systems, monitoring, DDoS protection, runbooks, GameDay exercises, etc. — because we dread the consequences of failure, such as service outages and data loss. These measures are supposed to “provide a series of shields that normally divert operations away from accidents”. And luckily, they’re successful most of the time.

3. Catastrophe requires multiple failures — single point failures are not enough

Cook writes:

Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure.

Most failure trajectories are successfully blocked by the aforementioned defenses or by the system operators themselves.

Later in this article, you’ll learn why there’s no such thing as a single root cause.

4. Complex systems contain changing mixtures of failures latent within them

High complexity ensures there are multiple flaws — bugs — present at any given moment. Operators have to deal with ever-changing failures due to “changing technology, work organization, and efforts to eradicate failures”. Anyone who’s worked on a larger software project knows this is true. At some point, someone did something and it had an unintended consequence.

According to Cook, we don’t — and can’t — fix all latent bugs because of “economic cost but also because it is difficult before the fact to see how such failures might contribute to an accident”. We’re prone to think of these individual defects as “minor factors during operations”. However, as we just learned, several of these supposedly minor factors can lead to catastrophe.

5. Complex systems run in degraded mode

A consequence of the preceding observation is that “complex systems run as broken systems”. Most of the time, they continue to work thanks to redundancies — database replicas, server auto scaling, etc. — and thanks to knowledgeable operators who fix problems as they arise.

But at some point systems will fail. It’s inevitable.

A postmortem might find that “the system has a history of prior ‘proto-accidents’ that nearly generated catastrophe” and that operators should have recognized the degradation in system performance before it was too late. However, that’s an oversimplified view. We need to realize, instead, that “system operations are dynamic, with components (organizational, human, technical) failing and being replaced continuously”. Attribution is not that simple, as you’ll see in a minute.

6. Catastrophe is always just around the corner

Disaster can occur at any time and in nearly any place. The potential for catastrophic outcome is a hallmark of complex systems. It is impossible to eliminate the potential for such catastrophic failure; the potential for such failure is always present by the system’s own nature.

Just because there are no problems now doesn’t mean it’s going to stay that way. Sooner or later, any complex system will fail. That’s why operators should never get too comfortable.

As I wrote in the past, complacency is the enemy of resilience. The longer you wait for disaster to strike in production — merely hoping that everything will be okay — the less likely you are to handle emergencies well, both at a technical and organizational level.

7. Post-accident attribution to a root cause is fundamentally wrong

In complex systems, such as web systems, there is no root cause. Instead, accidents require multiple contributors, each necessary but only jointly sufficient. In the words of Cook:

Indeed, it is the linking of these causes together that creates the circumstances required for the accident. Thus, no isolation of the ‘root cause’ of an accident is possible.

One of the reasons we tend to look for a single, simple cause of an outcome is because the failure is too complex to keep it in our head. Thus we oversimplify without really understanding the failure’s nature and then “blame specific, localized forces or events for outcomes”.

8. Hindsight biases post-accident assessments of human performance

The key point made here:

Hindsight bias remains the primary obstacle to accident investigation, especially when expert human performance is involved.

Wikipedia has a good explanation of hindsight bias:

Hindsight bias, also known as the knew-it-all-along effect […] is the inclination, after an event has occurred, to see the event as having been predictable, despite there having been little or no objective basis for predicting it.

Which tells us that it’s impossible to accurately assess human performance after an accident, e.g. when doing a postmortem. Still, many companies continue to blame people for mistakes when they should really blame — and fix — their broken processes.

9. Human operators have dual roles: as producers and as defenders against failure

Operators actually have not one but two roles, each with its own demands. On the one hand, they operate the system so that it can do what it’s supposed to do. On the other hand, they defend the system against failures. According to Cook, this poses the following problem:

Outsiders rarely acknowledge the duality of this role. In non-accident filled times, the production role is emphasized. After accidents, the defense against failure role is emphasized. At either time, the outsider’s view misapprehends the operator’s constant, simultaneous engagement with both roles.

This duality reminds me, in a way, of today’s Site Reliability Engineers, who are responsible for ensuring that services are available and fast enough, and who also progress the software and systems behind those services. This duality is, in fact, at the heart of SRE. I’m glad our industry has started to embrace this idea.


That’s the end of part 1. Of course, I encourage you to read the original treatise to get the whole picture. I found a lot of value in it — and so might you.

P.S. This article first appeared on my Production Ready mailing list.

Show your support

Clapping shows how much you appreciated Mathias Lafeldt’s story.