How Complex Web Systems Fail — Part 2
In his influential paper How Complex Systems Fail, Richard Cook shares 18 brilliant observations on the nature of failure in complex systems. Part 1 of this article was my attempt to translate the first nine of his observations into the context of web systems, i.e., the distributed systems behind modern web applications. In this second and final part, I’m going to complete the picture and cover the other half of Cook’s paper. So let’s get started with observation #10!
10. All practitioner actions are gambles
Cook notes that all actions we take in response to an accident are just gambles. There are things we believe we know (e.g., because we built the system in a such-and-such way), but conversely, there are also things we don’t know (and even ones we don’t know we don’t know). The overall complexity of our web systems always poses unknowns. We can’t eliminate uncertainty — the guessing of what might be wrong and what might fix it.
As we learned in part 1 of this article, it’s impossible to correctly assess human performance after an accident due to cognitive errors like hindsight bias (see observation #8). A similar but different phenomena is the outcome bias, well illustrated by Cook:
That practitioner actions are gambles appears clear after accidents; in general, post hoc analysis regards these gambles as poor ones. But the converse: that successful outcomes are also the result of gambles; is not widely appreciated.
11. Actions at the sharp end resolve all ambiguity
More often than not, companies don’t have a clear direction when it comes to “the relationship between production targets, efficient use of resources, economy and costs of operations, and acceptable risks of low and high consequence accidents”, as Cook states. I would even go so far as to say that, in the absence of hard numbers, decisions are made following someone’s gut feeling.
This ambiguity is resolved by actions at the sharp end of the system, successful or not. After a disaster has struck in production, we’ll know, for example:
- Management’s response to failure
- What went well, what went wrong
- If we need to hire another Site Reliability Engineer
- Whether we should invest in employee training or better equipment
In other words, we’re forced to think and decide.
Once again, we need to be cautious of hindsight bias and its friends, and never “ignore the other driving forces, especially production pressure” after an accident has occurred.
12. Human practitioners are the adaptable element of complex systems
It’s people that keep web systems up and running by incrementally improving them — adapting them to new circumstances — so that they can survive in production.
The paper lists the following adaptations as examples:
- Restructuring the system in order to reduce exposure of vulnerable parts to failure.
- Concentrating critical resources in areas of expected high demand.
- Providing pathways for retreat or recovery from expected and unexpected faults.
- Establishing means for early detection of changed system performance in order to allow graceful cutbacks in production or other means of increasing resiliency.
It’s surprisingly straightforward to translate this list into best practices in the field of web operations: decoupling of system components, capacity planning, graceful error handling, periodic backups, monitoring, code instrumentation, canary releases, and so on.
13. Human expertise in complex systems is constantly changing
Complex systems require substantial human expertise in their operation and management. This expertise changes in character as technology changes but it also changes because of the need to replace experts who leave.
Furthermore, Cook writes that a complex system will always “contain practitioners and trainees with varying degrees of expertise”. Problems arise when knowledge isn’t spread equally in the team(s) responsible for the production stack.
I made the experience that pair programming is a very efficient way to share knowledge (yes, even in web operations). This is especially true when a legacy system is involved and your pairing partner happens to know more about it than he/she likes to admit…
14. Change introduces new forms of failure
As a matter of fact, even deliberate changes to web systems will often have unintended negative consequences. There’s a high rate of change and often a variety of processes leading to those changes. This makes it hard — if not impossible — to fully understand how all the bits and pieces resonate with each other under different conditions. Put another way, web systems are largely intractable, which is a major reason why outages are both unavoidable and unpredictable.
Cook adds to this what I consider one of the most useful insights I gained from his paper, worth quoting in length:
The low rate of overt accidents in reliable systems may encourage changes, especially the use of new technology, to decrease the number of low consequence but high frequency failures. These changes maybe actually create opportunities for new, low frequency but high consequence failures. [Because they] occur at a low rate, multiple system changes may occur before an accident, making it hard to see the contribution of technology to the failure.
15. Views of “cause” limit the effectiveness of defenses against future events
The statement that “post-accident remedies for human error are usually predicated on obstructing activities that can cause accidents” reminds me, more than anything else, of airport security theater, which also does little to prevent further accidents.
Cook urges us to not increase the coupling of our web systems in a knee-jerk reaction to failure:
Instead of increasing safety, post-accident remedies usually increase the coupling and complexity of the system. This increases the potential number of latent failures and also makes the detection and blocking of accident trajectories more difficult.
16. Safety is a characteristic of systems and not of their components
Chaos theory tells us that small causes — involving human action or not — can have large effects. Everything is connected. The paper says:
Safety is an emergent property of systems; it does not reside in a person, device or department of an organization or system. Safety cannot be purchased or manufactured; it is not a feature that is separate from the other components of the system.
Consider this example: Learning to embrace failure — a prerequisite for reliability — requires a fundamental shift in the mindset of managers and employees, if not whole companies. We can’t build reliable web systems by merely improving the codebase.
17. People continuously create safety
“Failure free operations”, as we learned today, “are the result of activities of people who work to keep the system within the boundaries of tolerable performance [on a moment by moment basis]”.
Most of these activities are well-known processes, probably documented in a runbook, such as reverting a bad deployment. Sometimes, however, it requires “novel combinations or de novo creations of new approaches” to repair a broken system. In my experience, the latter is in particular the case with irreversible failures, where you can’t simply undo the action that caused it.
18. Failure free operations require experience with failure
This last point is a topic near and dear to my heart.
When failure is the exception rather than the rule, we risk becoming complacent.
Recognizing hazard and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure.
For me, the perfect embodiment of this idea is Chaos Engineering, a discipline based on the realization that proactively triggering failures in a system — through intentional actions at the sharp end — is the best way to prepare for disaster.
If you want to learn more about Chaos Engineering, here are some links for you to check out:
What better way to end this article than to leave you with Cook’s response to part 1, in which he provides additional context and links to his Velocity talks. Highly recommended for further study!
Update: Cook also wrote a thorough response to part 2 with more insights on observation #18.
P.S. This article first appeared on my Production Ready mailing list.