Combining Cynefin with Swarming for better Incident Management
Cynefin is an intriguing framework. With its basis in complexity theory, it is built on the principle that different situations require significantly different kinds of response. Its creator, Dave Snowden, succinctly describes it as “making sense of complexity in order to act”.
In recent years, Cynefin has been gaining significant attention in the DevOps community. It’s also gaining traction with ITSM professionals, particularly those considering the future evolution (or replacement) of established practice frameworks such as ITIL.
This cross-community interest in Cynefin is reminiscent of an similarly shared interest in Swarming, a philosophy which rejects established siloed or multi-tier team hierarchies, in favour of flexible, adapting workgroups. I wrote about Swarming in a previous article.
It is not just this shared interest that Cynefin and Swarming have in common. In this article, I will argue that there is a close fit between the two philosophies, and will show how Swarming-based technical support might be organised and optimised specifically around the principles of Cynefin. In doing so, we can realise Cynefin in a support environment.
Cynefin: A quick overview
So what is Cynefin? Firstly, let’s deal with the pronunciation! Cynefin is a Welsh word, without an exact equivalent translation into English (its nearest equivalent is often said to be “habitat”, though Snowden states that it “signifies the multiple factors in our environment and our experience that influence us in ways we can never understand”). It is pronounced “kuh-nev-in”.
Cynefin categorises any given problem or decision scenario into one of five domains, as shown in this illustration:
An excellent overview of these domains, by Snowden himself, can be found in the first ten minutes this 2015 episode of the Boss Level Podcast. In short:
- Obvious and Complicated are the “order domains", in which everything has a repeating relationship between cause and effect. In each of these order domains, as Snowden himself puts it (6m50s):
“The same thing will happen in the same way twice. The difference is in Obvious everyone can see what it is whereas in Complicated you have to do analysis to find it. In Complicated you have good practice. In Obvious you have best practice, and that’s a really important distinction”.
- In the Complex domain, understanding the problem requires experimentation and investigation. Complex systems fail in complex ways. Statements can only be made about the present, not the future, and the overall solution only becomes apparent once it is discovered. Over time, it may be possible to establish sufficient knowledge and constraints to move the situation from Complex to Complicated.
- Chaotic scenarios are dramatic and unconstrained. The priority is containment and damage reduction, and any initial solution is focused primarily on these objectives. At this point, the goal is typically to move the issue to one of the other domains.
- The fifth domain is Disorder. This is the state of being unaware of which domain applies to the current situation. The priority at this point is to find the facts needed to make the move to another domain.
Swarming — a quick overview, and example
Swarming, in the customer support context, typically refers to use of networked collaboration in place of the more traditional multi-tiered, escalation-driven support structure (which is typically formed around a “Level 1” service desk, with “Level 2” escalation support and “Level 3” teams with specific single areas of expertise, as a final escalation option).
There is not a single, definitive structure for Swarming. Swarming is, by definition, dynamic. Enterprises who are adopting Swarming for customer or technical support tend to adopt several different types of swarm, to address different scenario types. However, these structures may be loose, with teams themselves encouraged to be adaptive, focused on results rather than rigid rules.
In my previous Swarming article, referenced above, I described the three swarming structures used at BMC for different situations. This is a summary:
“Severity 1” Swarm: Used for a tiny percentage issues, which are major enough to present a very significant business impact. This swarm is analogous to (and generally indistinguishable from) the “war room” model of traditional support structures. An assigned support leader pulls whoever is required into a workgroup, and they progress the issue until it is fixed.
Dispatch Swarm, sometimes alternatively called Triage Swarm: A small group of support agents, typically with mixed levels of experience, focused on issues as they arrive. They meet at frequent and intervals through each day, monitoring the incoming flow. This swarm has a dual role: primarily as “cherry pickers” of issues which can be resolved quickly, and secondly as data quality validators for issues which will be assigned onwards to product-line support teams.
Backlog Swarm: Self-organising swarms containing a cross-functional set of technical experts, which are typically convened by product support team members. They meet on a periodic basis specifically to tackle tricky issues which might otherwise linger in their queues, or bounce from team to team in search of resolution.
Applying Swarming to deliver Cynefin
The swarming structure defined above is one example. But can we take Cynefin’s practices and organise swarming around them? The answer seems to be yes.
In this state, the nature issue is completely unknown. Cynefin proposes no specific actions other than to find the appropriate domain into which an issue should be placed. This might simply be done by one individual first triaging the issue as it arises.
For simple issues in the the Obvious domain, Cynefin’s advice is to Sense, Categorize and Respond. The solution would normally be defined and understood already, and simple to identify. Identification of the type of issue can drive self-service resolution, or enable a Service Desk agent to work from a straightforward template or knowledge document. Ideally, resolutions to Obvious issues which have occurred for the first time should be recorded as knowledge articles.
In the Complicated domain, an issue is not straightforward to identify and resolve, but there is still a clear relationship between cause and effect,. Hence, there should still be a recordable and repeatable resolution. Rather than being a matter of simple categorization, analysis is required (Sense, Analyze, Respond).
If we can assume that the ability of an individual to perform this analysis improves with knowledge and experience, there is a good case for adopting something akin to the “Dispatch Swarm” described above, pairing experienced agents with less experienced agents to ensure that skills propagate. Capturing detailed knowledge information for issues occurring for the first time ensures further organisational learning, and faster response the next time the same combination of cause and effect occurs.
For issues falling into the Complex domain (Probe, Sense, Respond), solutions are emergent rather than predetermined. Solving these issues requires experimentation and iteration. Multiple points of data may need to be collected, some of which may subsequently be discarded.
As a result, complex issues are very well suited to a Swarming approach — much more than to the “traditional” tiered support model. While the latter relies on reassignments between fixed teams with single roles, swarming makes the “team” dynamic, enabling it to shape to the current state of the investigation, even dividing for a time to enable parallel investigations.
This example shows a swarming exercise led by an issue owner acting as Swarm Lead, aided by an assistant.
After initial fact-finding, the swarm expands slightly to bring in people who can build hypotheses as to what is causing the issue. This enables two separate investigative routes to be taken in parallel, with the swarm dividing into sub-groups (the presence of the assistant allows the Lead’s group to maintain dialogue with the other subgroup).
After eliminating non-productive theories, and discarding swarm members who are no longer needed, a final swarm lineup converges to enable resolving actions to be applied. Further iterations might occur after this point.
Support organisations confronted with the Chaotic domain need to seek to transform the situation from chaos to complexity, but the biggest priority is to stop the bleeding. One of the most important roles for the leader is to establish order.
A chaotic situation may require a Swarm leader to establish several different sub-Swarms, focused both on dealing with the acute situation, and discovery of sufficient information to move the issue to the complex domain, such as in this example:
It feels self-evident that Swarming is a good match to the principles of Cynefin. In particular, it offers an approach which aligns much better to the non-deterministic Cynefin domains, Complex and Chaotic. These domains describe situations which shift, and which require adaptive and iterative approaches to which a rigid, siloed assignment model does not readily adapt.
Cynefin’s framework arguably also reveals clear opportunities for continuous organisational improvement, through the enhancement of Swarming techniques, and the honing of their delivery, according to Cynefin’s practices and transitions. For example, an improvement-focused organisation might seek to:
- Streamline responses to Obvious issues through checklists and self-service.
- Actively set up the team rotation process for Complicated issue responses to ensure the best possible knowledge sharing and development of subject matter expertise.
- Measure the outcomes of different sub-swarms in the probing practices for Complex issues, to drive continually better investigation and teaming decisions.
Cynefin is well thought out, and continues to develop as it gains interest in the DevOps community. For enterprise support, Swarming seems to be a very credible way to deliver it.