Moving Beyond Newtonian Reductionism in the Management of Large-Scale Distributed Systems, Part 2
In Part 1 of our series on complexity in large-scale, distributed systems, we called into question the utility of traditional reductionist approaches for understanding and predicting their failure, sometimes with tragic consequences. In this post, Adobe Experience Platform’s Daniel Marcus provides an approach to mitigate this problem, in part, by using the very features of complexity that incubate failure to create the conditions under which reliability can flourish.
In my last post, I discussed the efficacy of Newtonian reductionism in the face of the variance and uncertainty imposed by the unfortunate fact that in the real world, complex, adaptive, nonlinear systems hold sway.
Well, Newton isn’t obsolete. In fact, let’s be fair. That worldview, that linear reductionism, is responsible for most of what we call modern civilization. The 5 Whys still have to be part of our toolkit. The root cause isn’t fake news. But they are a beginning, not the end game. They are table stakes. And, by themselves, they’re not enough.
Four ingredients to high-reliability organizations
According to an article in the Naval War College Review(1), there are four ingredients to high-reliability organizations:
- Leadership commitment: It’s gotta come from the top.
- Redundancy: Systems, networks, environments, skills. It’s how we’re going to build a 4 9’s platform out of 3 9’s elements.
- Decentralization, culture, continuity: Delegate the command structure for reliability across the teams. Bake it into the day to day operations.
- Organizational learning: Experiment. Drill. Learn from mistakes and re-inject those learnings into the organization.
So let’s take that to the street. What can we do, concretely, to implement these principles and buffer ourselves against the impact of complexity?
It turns out that there is a lot we can do, and at Adobe, we’re doing a lot of it:
Build observable systems, systems that introspect, that are transparent, that talk to us.
Test in prod or near-prod. Conduct experiments to tease out failure modes via chaos engineering.
If incidents are inevitable, let’s get really, really good at managing them. Drill and measure and track metrics.
Make communications around incidents bulletproof.
Design applications that fail gracefully along functional subsystem boundaries.
Build systems that manifest business strategy. What do I mean by that? Say part of your business strategy is to improve customer retention. I mean, it has to be, right? And, part of your retention program is to harden your systems, to increase availability. Part of your hardening program might be to limit incidents that arise from configuration drift. A tool like Chef is based on a “mutable infrastructure” paradigm(2); that is, it runs updates in place on your existing hosts. Over time each host develops a unique change history, and eventually, you have a cluster full of snowflakes, vulnerable to subtle defects that can arise due to config deltas across the cluster and between test and prod infrastructures. Whereas with a tool like Terraform, changes are actually deployments of an image. This “immutable infrastructure” paradigm(3) eliminates an entire class of problems and a huge potential for drift. Now, there’s no such thing as a free lunch. The learning curve for Terraform is brutal and unforgiving, as anyone who has inadvertently blown away an entire environment can attest. But this is a training issue and more benign and local than configuration drift.
Manage change — make it incremental and reversible where possible.
Build great learning organizations. Institutionalize retrospectives and blameless post-mortems. Inject lessons learned back into our DNA so that we don’t make the same mistakes over and over again.
Cross-train — across disciplines, across teams. This not only creates redundancy in subject matter expertise but it supports the diversity of thoughts and opinions.
Quality needs to permeate everything we do and it should be standardized across solutions. An example of this is Adobe’s Operational Readiness program: the combined efforts of Adobe’s site reliability and product engineering teams to ensure that prior to release, applications have everything they need to function properly and operational risks are minimized.
Un-silo. We are trained, maybe deep in our ancestral DNA, to be tribal and local — to build and protect our fiefdoms. Here, in Adobe Experience Platform, that’s the kiss of death. As noted earlier, failures emerge in environments characterized by scarcity and competition. Technical and social antipatterns self-reinforce within silos. So collaborate, cooperate, work across organizational boundaries. We need to build a deep bench with the strength to support both ephemeral cross-team efforts to fight local fires, and more durable constructs such as topical centers of excellence that span solution boundaries, like Adobe’s Chaos Engineering program.
Pay attention to the squishy human stuff. Practice openness and transparency. Practice active listening. Build leadership skills, and be self-examined. As managers, aggressively defend work-life balance in our teams. Support diversity in all its manifestations — certainly race, culture, faith, gender, sexual orientation — but also diversity of thought, opinion, and approach to problem-solving. Diversity is not just a checkmark on your hiring pipeline, it’s a competitive advantage. Be respectful, be kind, and treat others the way you’d like to be treated. It not only makes our workplaces more pleasant but it contributes directly to the resilience of our teams and our technologies.
You might say, well, these are all things that we do in the course of doing our jobs. Best practices. Industry standards. But I assert that they are more than that — they insulate against the variance that arises by virtue of complexity in the systems we build and operate.
In the same way that failure is an emergent property of complex systems, reliability, too, is an emergent property(4). And it’s largely through these metatechnical tenets and actions that we create the conditions in which it can flourish. It’s worth repeating…
Reliability is an emergent property.
It’s a bit of an Aikido play, isn’t it? We take the vulnerability due to the characteristic of complex systems that failure modes are emergent properties and turn it on its ear by using this notion to incubate conditions to enhance reliability.
Here, I will give you a concrete example of how, using the ingredients I described above, we limited the impact of complexity on the operation of a large-scale production system. Step with me into the way-back machine to the halcyon days of summer 2018 (there’s my timeless artwork again). Summer evokes thoughts of baseball (maybe cricket), BBQ, hiking… and, for me and a lot of my colleagues, Adobe Experience Platform Pipeline.
Adobe Experience Platform Pipeline is rapidly becoming the standard message bus for Adobe, facilitating the flow of data between Adobe services, both within and between data centers. It is by far the largest deployment of Kafka in Adobe — Kafka at a global scale:
- 12 data centers on four continents
- Three clouds (AWS, Azure, and bare metal)
- 60 billion messages per day and growing
Adobe Experience Platform Pipeline is truly a mission-critical service at Adobe. And in Summer 2018 we were hurting. Chronic instability, multiple CSO’s, terrible alert signal to noise, unhappy customers, unhappy team. Bad architecture smells, serious technical debt. For example, we had services coresident in Kafka brokers that had no business sharing a host. It was like Thanksgiving and all your inlaws are sharing a single room in a Motel 6. What could possibly go wrong?
We put together a very aggressive “get well” plan. Here are the main technical points: topic deletion, infrastructure refactoring, timeout settings, software upgrades, heap settings, improved alerts signal-to-noise, new dashboards so we had a better line of sight into Pipeline internals. Multiple fronts, all essential, and we executed well.
Had we just left it at that, I am firmly convinced that our success would have been limited and fleeting. The technical get well program was necessary but not sufficient.
So what else did we do?
Let’s double click on DevOps. In the old days, you had product engineering teams tasked with cranking out features under the duress of time constraints, and ops teams tasked with maintaining stability. These are obviously antithetical goals — pushing the envelope vs status quo. Dev would toss a tarball over the fence for Ops to deploy and if anything went south the finger-pointing started. The DevOps movement was born, in part, to mitigate this social antipattern. Certainly, DevOps is about tools, measurement, and automation. But it’s also very much a cultural sea change, a mindset.
I believe the secret sauce for successful DevOps is empathy. Understanding the other person’s point of view. Walking a mile in their shoes. This sets the foundation for approaching engineering problems from the standpoint not of siloed dev and ops cultures, but as a single team.
In Adobe Experience Platform Pipeline, my counterpart in product engineering and I made a commitment to each other, to our management, to our respective teams, to build a DevOps culture in Pipeline, to move forward as a single team. Here’s what we did:
- We stopped blaming each other for stuff.
- We talked to each other more.
- We clarified mutual expectations.
- We restructured on-call.
- We refactored an unwieldy delivery organization into semi-autonomous squads with separate backlogs and sprints.
We’re still not perfect, far from it. In fact, we have a lot of work ahead of us. But the cultural change — the focus on these metatechnical actions as well as the technical — had a profound impact. The following figure represents the only actual data in this post. In 2018, we had 12 Sev1 and Sev2 incidents. So far, in 2019 we’ve had only three incidents.
So … let’s wrap this up
Reliability is an emergent property.
Cartesian reductionism is inadequate to the task of understanding and managing complex, adaptive technology systems in the wild.
Hindsight != foresight. Understanding the mechanistic cause of the last incident will not, on its own, help you prevent the next one.
Root cause analysis is useful but it is the beginning of remediation, not the end game. It is table stakes.
Analysis needs to focus on the relationships between components more than on the components themselves.
The dynamic nature of the interactions between complex systems and their environments and constraints renders them susceptible to drift.
Siloes are an organizational antipattern.
By using the very features of complexity that incubate failure, we can turn the tables and create the conditions under which reliability will flourish.
Follow the Adobe Tech Blog for more developer stories and resources, and check out Adobe Developers on Twitter for the latest news and developer products. Sign up here for future Adobe Experience Platform Meetups. For exclusive posts on Adobe Experience Platform, follow Jaemi Bremner.
- Rochlin, G.I., LaPorte, T.R., and K.H. Roberts. 1987. The self-designing high reliability organization: Aircraft carrier flight operations at sea. Naval War College Review, 40 (4) pp. 76–90.
- Brikman, Yevgeniy. 2017. Terraform: Up and Running. Sebastopol, California: O’Reilly Media. 368 pp.
- Brikman, Yevgeniy. 2017. Ibid.
- Dekker, Sidney. 2011. Ibid. Dekker distinguishes between safety and reliability in his text, but in our problem domain, I feel that this is a distinction without a difference and prefer reliability as the more accurate descriptor.