Learning from DevOps: Why complex systems necessitate new ITSM thinking

“Distributed systems have an infinite list of almost impossible failure scenarios”

So said Charity Majors at the 2019 Configuration Management Camp conference in Belgium. You can watch Charity’s amazing presentation in full, here:

The talk highlighted the difficulty of performing traditional testing, deployment and monitoring, when building and deploying software into modern distributed systems. Charity gave some great examples of some complex issues she had encountered, all of which would have been impossible to discern via conventional monitoring approaches. Here are three of them:

“All twenty app services have 10% of nodes enter a simultaneous crash loop cycle, about five times a day, at unpredictable intervals.It clears up before we can debug it, every time”
“We run a platform, and it’s hard to distinguish between problems that users are inflicting on themselves, and problems in our own code, since they all manifest as the same errors or timeouts”.
“I have 20 microservices and three datastores across three regions, and everything seems to be getting a little slower over the past 2 weeks…but nothing has changed that we know of. Latency is usually back to the historical norm on Tuesdays”

Monitoring, argues Majors, may be just fine for straightforward tiered systems (such as a standard stack of an application, running on an OS instance, using a SQL database, with a UX delivered to users via a web server). In each of these cases, however, what could be monitored to predict or prevent these issues? Modern applications, Majors contends, are underpinned by a much broader, rapidly-changing, dynamic architecture: our enterprise services have become complex distributed systems.

These cases illustrate a broader issue beyond monitoring and observation. Most organisations have organised their support structures around siloed teams and reassignments. When a customer is reporting the symptoms being encountered in the cases above, service agents are asked to create a ticket and assign it to the appropriate team. In the cases above, the correct destination may be far from obvious.

Furthermore, as the investigation develops, it is likely that multiple perspectives will be required. In a reassignment-based work structure, those tickets are likely to be passed from team to team, and queue to queue, perhaps until the situation reaches a point where a noisy escalation brings people around a single table. Complexity requires a more sophisticated approach than “pass the parcel”.

The nature of complex systems is arguably much better understood in the DevOps community than in ITSM (and perhaps the wider IT industry in general). DevOps is characterised by a reinvented approach to software delivery which was fundamentally driven by the challenges of complexity. Its shift towards smaller, more frequent changes was a direct response to the difficulty (and unreliability) of delivering large, infrequent changes in complex modern systems. If we need to apply new thinking to building and growing complex systems, then it seems obvious that this also applies to supporting them.

In 1990, Dr Richard Cook (@ri_cook)of the University of Chicago published a short paper entitled “How Complex Systems Fail”. The paper explores the characteristics of complex systems, and the circumstances which cause them to fail. Although it wasn’t written with IT systems in mind (Dr Cook was a medical doctor), its points are highly relevant:

  • Complex systems contain mixtures of latent issues: A system’s complexity means that it is impossible for it not to contain multiple flaws.
  • Most flaws are insufficient to cause significant issues. They are regarded as minor factors during operations.
  • The flaws change constantly, due to evolving technology and organisational factors, and even as a result of efforts to resolve existing flaws.
  • Complex systems run as broken: Redundancies in the system, and the ongoing expertise and effort of humans, ensure that the system continues to function, sometimes in degraded mode.
  • Issues have multiple causes, not a single root-cause. Each single flaw is insufficient to cause a major issue. It is the linking of multiple faults that creates the circumstances required for a significant failure.

If the DevOps community is coming to realise that such systems can not be monitored in the normal way, then surely it’s equally important for ITSM to consider the impact of growing system complexity on our established working practices?

To do so, we to understand is how issues in complex systems must be addressed. Dave Snowden’s Cynefin framework, rooted in complexity theory, explores these challenges in depth, and is an excellent basis for considering the impact of complexity on support.

A complex failure, Snowden would argue, lacks a single, consistent path from a cause to an effect. Multiple factors are likely to be causal. Additionally, the evidence available might support conflicting theories about what is causing the issue. To troubleshoot in this circumstance, Cynefin defines a number of steps to take:

  • Identify multiple hypotheses for what might be happening.
  • In parallel, test each “coherent” (i.e. plausible) hypothesis using small, safe-to-fail experiments.
  • Observe the impact of the experiments.
  • Where positive outcomes are observed, attempt to amplify them. With negative outcomes, attempt to dampen their effect.

It is quite clear that these steps would be difficult, if not impossible, in a support organisation bound by a linear, tiered approach to work distribution. In such a structure, multiple teams can be engaged, but work flows discretely from one to another in a linear sequence. The Cynefin guidance for managing complex situations, conversely, is distinctly non-linear and cyclical. Parallel work is emphasised as vital in the safe-to-fail experimentation phase, because the human mind tends to be drawn inexorably towards the first observed correlation.

We can’t, therefore, easily adapt the linear, tiered support model to such an approach. A Swarming approach, on the other hand, is a much better fit, as this illustration shows:

An hypothetical example of swarms forming and reducing as Cynefin’s Complex response guidance is followed

Interestingly (particularly bearing in mind Dr Richard Cook’s profession), when I tweeted the above graphic, after presenting it at the SRVision conference in Utrecht, it was noticed by a group of medical doctors. This led to a discussion on how well this approach would also fit complex medical situations:

“When reading through a healthcare lens this fits brilliantly with the concept of breaking down silos and running small QI tests/cycles” — @christymboyce
“Unfortunately, in health and care, patients get bounced from silo to silo way too often without getting the care they need and without their information having been transferred with any fidelity if at all” — @gatewaymedic

I’ve argued before that a deconstructive analysis and reimagining of IT service management is critical to ensuring its relevance and value into the future. The work of Charity Majors, Richard Cook and Dave Snowden highlights complexity as one of the major forcing factors behind this rethinking. Complexity is increasing, complex systems fail in complex ways, and complex failures need dynamic and adaptive responses.