Moving Beyond Newtonian Reductionism in the Management of Large-Scale Distributed Systems, Part 1
Author: Daniel Marcus
This post, the first of a two-part series, explores the impact of complexity in large-scale, distributed systems. Adobe Experience Platform’s Daniel Marcus asserts that traditional reductionist approaches are insufficient for understanding and predicting the failure of such systems, sometimes with tragic consequences. In Part 2, we will provide an approach to mitigate this problem, in part, by using the very features of complexity that incubate failure to create the conditions under which reliability can flourish.
This post explores the impact of complexity on the reliability of the large-scale systems we build and operate today. I’m going to assert that the growth of complexity in these systems has outstripped our ability to understand and predict their failure using traditional reductionist approaches. I’m going to walk through some examples in which these reductionist methods are catastrophically inadequate, and I’m going to offer a prescription for how we can stack the deck in our favor to achieve reliability margins that will support our business objectives. But first, I want to set the table.
The inspiration for this article arose from my attendance at SRECON19 Americas in Brooklyn earlier this year. It’s a great conference with a tremendous diversity of participants: young, old, male, female, managers, individual contributors, systems geeks, and software engineers. Everybody’s there to share war stories, argue with each other, and talk smack about running large scale systems. There’s a lot of positive energy and enthusiasm, and a great sense of community.
At every one of these conferences, there are always a few presentations that get everybody talking. At SRECon19, one of these talks was “How Did Things Go Right? Learning More from Incidents” by Ryan Kitchens at Netflix.(1) I’m doing this very interesting talk a bit of a disservice by taking a couple of Kitchens’ assertions out of context. But they were, I think, deliberately provocative, intended to generate reflection and discussion.
“5 Whys is a competitive disadvantage.” — Ryan Kitchens, SRECon19 Americas
Holy crap! That’s heresy, isn’t it? Most of you know the “5 Whys” — it’s a process that is deeply embedded in the Ops canon. Why did the customer experience latency? Because the load balancer cluster was breathing hard. Why was the load balancer cluster not performant? Because nodes kept failing health checks and dropping out of the pool, leaving the remaining cluster underpowered. Why were health checks failing? And so on. With the 5 whys, you drill down and in, down and in, with each question becoming more narrowly focused until you get to that nugget of darkness that brought you down. The problem, though, is that at each level of drill-down, you might be implicitly ruling out a host of contributing factors.
“There is no root cause.” — Ryan Kitchens, SRECon19 Americas
No root cause? What?! What does that even mean? There’s always a root cause. You just have to find it! And if you can’t, you’re not looking hard enough. But hold the phone — I think most of us have been in the unenviable position of either explaining to our managers, or as managers leaning on our team, to find an elusive root cause. Just find it, man! It’s gotta be there.
So, you look at logs until your eyes bleed, cross-correlate every incident and maintenance on God’s green earth, and you just can’t find anything. Fortunately, this doesn’t happen often — but it does happen.
There was a Pipeline incident late last year in which the API cluster spontaneously stopped communicating with the Kafka brokers and then started again about 90 minutes later. There had been a network engineering maintenance to which we were, of course, eager to assign blame. But, that turned out to be a red herring. After about a week of flogging Splunk and seeking correlations with upstream and downstream dependencies that got increasingly far fetched, we just gave up. We had to move on. So yeah, sometimes that nugget of darkness will defy your reductionist drill-down.
As you might imagine given these assertions, Ryan’s talk generated quite a lot of hallway buzz during the conference. (By the way, I want you to know that I did the artwork for this article entirely by myself. I didn’t get any help from anyone…)
As my manager is fond of saying, let’s unpack that a little bit. In fact, let’s unpack it a lot.
I personally think that the Cartesian coordinate system is one of mankind’s greatest inventions, right up there with the printing press, the pull request, Python (… and maybe Monty Python). One thing that’s cool about it is that it’s an abstraction that enables other abstractions: calculus, linear algebra, modern analysis, regularity theory, and so on. If you take these two or three dimensions and abstract them out to “n” dimensions, you will get (among thousands of other applications) the scaffolding for machine learning. For example, the basis for some recommendation algorithms is Euclidean distance minimization in an n-dimensional vector space.
More than anything else though, this grid encodes a worldview that comes to us across 300 years, since the time of Newton and Descartes. This worldview holds that the universe is rational, and predictable via cause and effect. You solve problems by breaking them down into smaller and smaller problems in a particular sequence. The world is linear. That is to say, smooth inputs yield smooth outputs.
A metaphor for this worldview is Laplace’s Demon. According to Laplace, in a completely deterministic universe, causal bonds link past, present and future. Laplace further argued that complete specification of the state of the universe at any given time allows one to perfectly predict the future and retrace the past. Laplace’s Demon is a hypothetical entity that possesses such complete knowledge.
Or maybe it’s the Three-Eyed Raven. (If you somehow missed the global phenomenon of Game of Thrones, I apologize. I’d be happy to explain it to you if you have about eight hours and beer. But, I can almost guarantee that you won’t like the ending.)
Brandon Stark notwithstanding, as software engineers, as systems engineers, we can’t all be Laplace’s demon (3). We have to be smarter than Laplace’s demon.
This is a map of city-to-city internet connections in the United States as of 2016. I don’t think there’s a smooth topological map that gets you from that nice Cartesian grid to this.
Real technology systems are complex.
They’re more about the relationships between parts than the parts themselves. They are sensitive to initial conditions. This is the idea of the Butterfly Effect — a butterfly flaps its wings in Brazil and spins up a hurricane in Florida. Of course, that’s a ridiculous extreme. But, the main idea is that small changes in inputs can give rise to unexpected, nonlinear responses.
It’s just one line of code, really! It’s just a config file change. We don’t even need a CMR! And then you spend the next five days in meetings that you’d really rather not be in. You’ve all been there. I know you have. I know I have.
They exhibit emergent properties.
What are the emergent properties? These are large scale attributes that arise from many small scale interactions. Wetness is an emergent property that arises from weak interactions between water molecules. Our neurons have basically two states, firing and not firing. And yet, in aggregate, this miraculous thing called consciousness results — an emergent property.
Complex systems are open to their environment. Externalities affect their state and sometimes bi-directional feedback loops are created in which the systems impact the environment, which in turn further impacts the state of the system.
Complex systems typically reside far from equilibrium (in no small part due to those pesky nonlinear, bidirectional feedback loops).
Complex systems are often in competition for scarce resources — money, staff, time, executive attention.
So what does all this mean?
Understanding the last incident may not help prevent the next one.(5) Remediation may contribute to future incidents. In fact, it probably will. Why? Remediation changes the state of the system and thus renders it vulnerable to emergent behaviors. You scale a cluster horizontally, you’ve changed the number of agents in the system. You scale it vertically, you’ve changed the weight of an individual agent. The state of the system changes and its relationship with other systems and its environment changes, allowing new behaviors to emerge.
If you follow this down the rabbit hole, the place you get to is that not only are incidents inevitable, but they are the inevitable result of success. Your product launch goes well and you greenlight new features. New features beget new behaviors, some of them pathological and emergent and difficult or impossible to predict.
Drift happens. What do I mean by that? Complex systems are comprised of agents and constraints. In complex technology systems, the primary constraints tend to be cost, workload, and safety.
So your system lives in this region is a kind of unstable equilibrium and tends to drift towards or away from these constraint boundaries as circumstances change — workload, cost, staffing levels, regulatory requirements, or as the internal state of the system changes — scale, features. And, it can drift right over the edge into failure.
I loved the Road Runner cartoons. I’m reminded of being a kid, sitting in front of the TV in my jammies on Saturday morning with a big bowl of Captain Crunch in my lap. But in fact, drift is deadly serious.
On January 28, 1986, the space shuttle Challenger exploded 73 seconds into its flight, killing all seven crew members. The cause was attributed to the failure of an o-ring seal between elements of the solid-fuel rocket boosters, allowing pressurized flame within the motor to escape confinement, resulting in the destruction of the vehicle.
On April 20, 2010, an explosion on the Deepwater Horizon oil rig killed 11 crew members, igniting a fire that sank the rig two days later. The explosion also left crude oil gushing from the seabed until it was finally capped nearly two months later on July 15. We learned later that the blowout was attributed to the failure of a component called a “blowout preventer”.(8)
You can imagine the conversations. “Why wasn’t the blowout prevented?” “Well, sir, it appears that the blowout preventer failed.” Cartesian reductionism at its finest.
This rather horrifying picture is almost certainly photoshopped (by the way, I think it’s kind of awesome that, as an Adobe employee, I work at a company whose products have become verbs). But, the situation it represents was very real. On January 31, 2000, Alaska Airlines Flight 261, en route from Puerto Vallarta to Seattle, suffered a loss of pitch control resulting from the failure of the horizontal stabilizer trim system.
Specifically, the jackscrew assembly nut threads in the trim system were stripped due to insufficient lubrication. As the pilots fought to control the aircraft, it rolled, and they flew it in this inverted state for some time before they ultimately lost control and it crashed into the Pacific, killing everybody aboard.
In each of these incidents, the post mortem investigations drilled down to the failure of an individual component — the o-ring, the blowout preventer, the jackscrew assembly. The failed component, however, is only a small part of the story. It’s not unimportant, but too much focus on the failed component is a dodge — it’s a seductive obfuscation that inhibits understanding of the real causes of failure.
Look at the Challenger disaster. There was a tremendous multi-agency effort, a Presidential Commission, to understand what happened. They drilled down into the o-ring failure, the test data, the management decision to launch that day in spite of record cold temperatures that impacted the o-ring elasticity. Entire forests were decimated for the paper the reports were written on. And yet, 17 years later, on February 1, 2003, the shuttle Columbia disintegrated on re-entry due to failure of the thermal protection system, killing everyone aboard. This is a completely different root cause. But, the systemic and organizational defects that incubated the failure were identical. Nothing had been learned.
The story behind each of these failures is a story of drift.
Engines of Drift
There are five concepts that together comprise the engines of drift(11), and they have a broad intersection with the properties of complex adaptive systems stated above.
Scarcity and competition
No organization operates in a vacuum. These are open systems in the continual transactions with what happens around them. The space shuttle program was in hot competition with the International Space Station for funding. At the same time, the planned shuttle mission frequency increased from two missions a year to 14 missions a year over a two year period, badly stressing the entire support ecology. The Deepwater Horizon at the time of the explosion was 43 days behind schedule, at a cost of $500,000/day just to lease the rig, not to mention salaries and other support costs.
Decrementalism (small steps)
Due to a variety of organizational factors, the lubrication interval for the jackscrew assembly in the Alaska 216 disaster increased bit by bit over a period of years until it was well outside the original design specifications. It was still considered to be within a normal operating range because the frame slowly changed. It’s like being nibbled to death by ducks. You drop the frog in cold water and crank up the heat so slowly it doesn’t know it’s being boiled alive. This is the normalization of deviance.
For example, we might observe out of memory errors that mysteriously migrate from host to host in a cluster. We’re unable to diagnose it but we surmise that we can fix the problem by doing a rolling restart. So, we do that. Then it shows up again like a bad penny after a couple of days, and so we apply the same fix. And about a day later it shows up again. Under time pressure — maybe an imminent release — we write a cron job that automates the rolling restart on a daily basis. Problem solved, right? Of course not. The actual problem is obfuscated, and in the fog of war, this hack becomes fossilized as part of normal operations. The pathological behavior has been normalized, and the poison remains in the system until it manifests in some other interesting manner. But this could never actually happen, right?
Sensitive dependence on initial conditions
There’s that butterfly again. Small local decisions. Change the lubrication interval. Ignore anomalous data on o-ring degradation. It’s just one line of code. It’s just a config file change! We don’t even need a CMR!
Unruly technologies introduce and sustain uncertainties about how, when, and why things fail. Ocean oil drilling technology is extremely unruly. In order to optimize rig location, sometimes you have to drill not just down but sideways under the ocean floor. This is bleeding-edge technology and difficult or impossible to test in the wild. We face a similar problem every day in coming up with testing strategies for the massive scale at which Adobe Experience Platform needs to operate.
Contribution of the protective structure(14)
This is a weird one. Because of the nature of their continual transaction with their environment, complex systems tend to collude with the protective structures that are intended to prevent them from failing. The web of relationships — governing bodies, working groups, manufacturer inputs — can block or redirect signals that the system is at risk. This certainly happened with the Challenger disaster. In the case of Alaska 261, there are multiple, lengthy processes by which maintenance guidelines are produced, validated, and published, which memorialized the increasingly bogus maintenance intervals and allowed the system to drift right off the edge.
So, in spite of our best efforts — sometimes even because of our best efforts…shit’s gonna break.
What can we do about this? Are we at the mercy of complexity? What about the 5 Whys? And, what about Newton? In my next post, we’ll address these questions with an example of how we used the four key ingredients to high-reliability organizations to limit the impact of complexity on the operation of a large-scale production system — Adobe Experience Platform Pipeline.
Follow the Adobe Tech Blog for more developer stories and resources, and check out Adobe Developers on Twitter for the latest news and developer products. Sign up here for future Adobe Experience Platform Meetups. For exclusive posts on Adobe Experience Platform, follow Jaemi Bremner.
- Kitchens, Ryan. 2019. How Did Things Go Right? Learning More from Incidents. SRECon Americas, March 25–27, 2019.
- Dekker, Sidney. 2011. Drift Into Failure: From Hunting Broken Components to Understanding Complex Systems. Boca Raton, Florida: CRC Press. 234 p.
- @aaronblowhiak. Why do we expect every engineer and operator to be Laplace’s Demon? Twitter, 11 Oct 2018.
- Dekker, Sidney. 2011. Ibid.
- Fischhoff, Baruch. 1975. Hindsight is not foresight: The effect of outcome knowledge on judgment under uncertainty. Journal of Experimental Psychology: Human Perception and Performance, 1 (3), pp. 288–303.
- Rasmussen, Jens. 1997. Risk management in a dynamic society: A modeling problem. Safety Science, 27 (2/3), pp. 183–213.
- Feynman, R.P. and R. Leighton. 1988. “What do you care what other people think?”: Further adventures of a curious character. New York, New York: Norton. 288 p.
- Lustgarten, Abraham. 2012. Run To Failure: BP and the Making of the Deepwater Horizon Disaster. New York, New York: Norton. 391 pp.
- Dekker, Sidney. 2011. Ibid.
- Feynman, R.P. and R. Leighton. 1988. Ibid.
- Dekker, Sidney. 2011. Ibid.
- Vaughan, Diane. 1996. The Challenger Launch Decision: Risky Technology, Culture and Deviance at NASA. Chicago, Illinois: University of Chicago Press. 620 pp.
- Dekker, Sidney. 2011. Ibid.
- Dekker, Sidney. 2011. Ibid.