ITSM, DevOps, and why three-tier support should be replaced with Swarming.

Introduction

DevOps is coming to IT enterprises, whether they are ready for it or not. In this article, I aim to do something that might be considered brave: I will argue that the current organizational structure of the vast majority of IT support organisations is fundamentally flawed.

More importantly, those flaws will make it difficult or impossible for those enterprises to achieve successful integration of their emerging DevOps practices, with their existing structures for technical customer support.

I will propose, instead, that the emergent practice of “Swarming” is an enterprise-ready methodology that is ideally placed to being technical support into the DevOps era.

Background: The Three-Tier Support Orthodoxy

We need to start with a short overview of the management structure which underpins the vast majority of large enterprise IT support functions.

The classic organisational structure for IT Service Management is the three-tier support hierarchy:

  • Level 1: A frontline Service Desk, directly fielding incoming customer communication (typically by answering phone calls).
    Most Service Desks are set up to provide a moderate level of generalised technical support, with the aims of presenting a consistent level of customer support to users, while resolving a significant number of incoming issues at the first-line.
  • Level 2: A second tier of support, often closely associated to the Service Desk, but with deeper general or specialist skills.
    Level 2 support agents may, for example, have additional training in support of common operating systems (such as Microsoft Windows) or hardware, and hence be able to resolve more complex issues affecting common technologies.
  • Level 3: Specialist support teams focused on specific technologies and applications. For companies which develop software in-house, it is typical to find specific Level 3 support teams assigned to individual applications or services.

Deconstruction of the three-tier structure requires a brief analysis of the business motives for it. It is almost ubiquitous in enterprise IT Service Management, and there are a number of business benefits driving this, which include the following:

  • Customers are presented with a single communication channel to the IT support organisation, regardless of the nature of their issue.
  • The general technical support skills needed to work in Tier 1 and Tier 2 support are easily found in the workforce. This also makes outsourcing of one or both of these layers straightforward, and as a result this is commonly seen.
  • Specialist technical resources can be insulated from direct contact, ensuring that only properly triaged issues reach them.

The journey of a customer’s case through this structure may start and end at the first line (in fact, in many organizations, customers have the opportunity resolve their issue through automated self-service — often described as “Level zero”).

There are inevitably many issues, however, which are not resolvable by Level 1 support. These progress to Levels 2 and 3 through a process of escalation:

Level 2 support agents typically handle fewer cases than their Level 1 counterparts, but these tend to be more complex, with a longer average effort on the part of the agent.

Tickets which make their way to Level 3 (either from a secondary escalation from Level 2, or directly from Level 1) typically account for a small volume of the overall incoming caseload, but they are also the most complex issues, requiring the most specialist skills, and the most time to resolve them.

There have been frequent attempts to benchmark the comparative cost of resolving a ticket at each level of support. This 2014 study, for instance, assesses the average cost of a Level 1 resolution as $22, with Level 2 resolutions costing $62, and Level 3 resolutions $85 (other studies have calculated the Level 3 resolution cost to be several multiples higher than this number).

Why Three-Tier is a problem, especially for DevOps

Challenging such a ubiquitous structure is challenging. However, the Swarming movement aims to do just that, on the basis of some significant addressable disadvantages to tiered support. Many of these disadvantages have particular consequences for DevOps:

  • Tiered Support creates multiple queues.
    While Level 1 support tends to be reactive and realtime, any case that can not be resolved at this level immediately enters a queue. Its nature changes, turning it from a current activity into a backlog item.
    As such, Levels 2 and 3 are stores of Work in Progress, a very problematic concept in the Lean philosophy which underpins the DevOps movement. Successful adoption of Lean practices like DevOps fundamentally requires assertive steps to reduce systematic Work in Progress. This alone is a major barrier to DevOps practitioners’ adoption of IT service management practices
  • Tiered Support blocks the route to the correct resolver.
    DevOps aims to promote increased ownership and autonomy. Developers are encouraged to take responsibility for the support of their own code. The highest performing DevOps organizations are achieving significantly faster resolutions on this basis (24 times faster, according to the 2016 State of DevOps report). However, this all comes to nothing if the ticket still crawls through several triage queues on the way to that expert. As one support manager at BMC put it to me, when discussing the company’s adoption of Swarming for customer support, “why were we putting our best people at the back of the process?”.
  • Tiered Support leads to cases “bouncing”
    In tiered support, the method of moving a case between them is simply to change the team which is assigned to it. This step is typically carried out unilaterally by the assigning team to the assignee team. The first time the new assignee sees the case ticket is when it arrives in their queue.
    Unfortunately, it is very common to see the ticket bouncing right back, either because the more specialised team requires further information to proceed, or because the assignment to them was completely incorrect. 
    DevOps is fundamentally built on collaboration between operational and development professionals. The vertical and horizontal silos inherent to Tiered Support, and the passive handovers of work between them, are the antithesis of this inter-disciplinary collaboration.
  • Tiered Support does not solve the problem of Subject Matter Experts becoming overwhelmed
    While one positive outcome of multi-tiered support is the prevention of easily-solved tickets finding their way to teams overqualified to work on them, it does not protect key specialists from high volumes of difficult cases. IT Service Management is plagued by “heroes”. These are typically very clever people who — at face value — appear to be incredibly valuable contributors to the success of the organization, repeatedly producing miracle fixes to tough issues. In reality, the hero is an overworked, burnout-prone single-point of failure, deliberately or inadvertently acting as the custodian of knowledge that the organization badly needs to be disseminated more widely. Tiered support, being a linear and siloed structure, does nothing to prevent the cult of the Hero. Arguably, it simply reinforces it.
    As enterprises shift to DevOps, we are already seeing the perpetuation of this scenario, with key DevOps team members appearing at the end of the escalation chain for high-volume spikes of tickets. The damage in a DevOps scenario is arguably worse: key developers are taken away from innovation to deal with a firehose of escalated (and already delayed) support issues.

Introducing Swarming as an alternative

“Collaborative communities can reach across the usual disciplinary and organisational silos that inhibit cooperation, learning, and progress”
(Don Tapscott and Anthony D. Williams, in “Wikinomics”)

“Swarming” appeared late in the last decade as a proposal for a new framework for technical support organisation. It explicitly rejects the three-tier orthodoxy, in favour of a model of networked collaboration:

SOURCE: Consortium for Service Innovation — http://www.serviceinnovation.org/intelligent-swarming/

A key pioneer for IT support was Cisco, who set out their new “Model for Distributed Collaboration and Decision Making” in a 2008 white paper, “Digital Swarming”. The concept was subsequently adopted by the Consortium for Service Innovation, and developed into a vision entitled “Intelligent Swarming”. Some of its core principles, in direct opposition to the orthodoxy, are that:

  • There should be no tiered support groups.
  • There should be no escalations from one group to another.
  • The case should move directly to the person most likely to be able to resolve it.
  • The person who takes the case is the one who sees it through to resolution.

Swarming in practice: an example structure for DevOps

There is not a single definitive structure for Swarming, particularly as it is a relatively new and unadopted concept. However, the example illustrated below (based on customer support swarming methods in place at BMC) is typical, and has generated significant improvements (as presented at the UK’s Servicedesk and IT Support Show in 2015).

Swarming starts as soon as any issue is not immediately resolvable at the point of customer contact. A rapid initial triage results in the distribution of case tickets to one of two “Swarms”:

Initial triage in a Swarm structure

Each “Swarm” is actually a small team, focused in near-realtime on the incoming flow of customer cases:

“Severity 1” Swarm

  • Three agents working on a scheduled weekly rotation.
  • Primary focus: Provide immediate response, and resolve as soon as possible.

A Severity 1 swarm is focused on the most critical issues. They coordinate the response to an acute situation, bringing appropriate people into the effort to resolve severe cases as quickly as possible. This process is not in itself different to the Major Incident processes typically employed in traditional tiered support. However, the other type of swarm used at this stage, to which the much larger volume of tickets goes, is different:

Dispatch Swarm

  • Meet every 60–90 minutes
  • Regional, product-line focused
  • Primary focus: “Cherry pickers”. What new tickets can be resolved immediately?
  • Secondary: Validation of tickets before assignment to product line support teams.

Dispatch Swarming addresses a key shortcoming of tiered support: many cases can be solved extremely quickly by the right expert, but they get lost in the backlog. Hence, a five minute resolution may actually take days.

The Dispatch Swarm is encouraged to “cherry pick”, disregarding anything that can not be resolved very quickly. In doing so, they are able to dramatically shorten the time spent achieving resolution for a significant subset of escalated cases.

There are significant secondary benefits, too. The inclusion of inexperienced frontline support staff in these Swarms gives exposure to knowledge that would otherwise only start to be gained after eventual promotion to more specialist teams. Meanwhile, conversely, third-tier support agents are brought closer to the customer.

Dispatch Swarming achieves rapid resolution of a significant minority of cases (at BMC, typically this is in the order of 30%), but the remaining cases will end up in the queues of more conventional-looking Product Line support teams. Here, many tickets will be quite straightforward for regular team members to work. However, another subset (perhaps 30% again) may be challenging enough to warrant attention from the best support people available, regardless of team structure.

This is where a third type of Swarm is used: the Backlog Swarm.

Backlog Swarm

  • Meet regularly, typically daily.
  • Primary focus: Address challenging tickets brought to them by product-line support teams.
  • Secondary: Replace the role of individual subject matter experts.

Backlog Swarms bring together groups of skilled and experience technical people, crossing boundaries such as geography and department, with the objective of focusing on the most difficult cases. Cases are referred to them by local engineering and support teams, who are no longer permitted to directly engage individual subject matter experts. They must, instead, always refer those cases to the appropriate Backlog Swarm.

The Case for Swarming, as Enterprises adopt DevOps

The failings of tiered support in traditional enterprises are magnified in a DevOps scenario. The system actively creates Work in Progress backlogs. It restricts autonomy and agility. It is fundamentally siloed. These issues are the antithesis of the DevOps philosophy, and this is becoming a major industry challenge as large, traditional Enterprises seek to build DevOps capabilities.

It is already possible to observe the negative outcomes resulting from this.

  • DevOps encourages the teams building software to take ownership of supporting it — sometimes colloquially summed up as “you wrote it, you fix it”. However, in an enterprise-scale support enterprise, the tiered support structure is typically the primary path along which these cases arrive. As we have seen, the layers of separation between the front-line and the DevOps team can result in tickets arriving late, and poorly qualified.
  • “Throw it over the fence” integrations between ITSM ticketing tools, and the software development lifecycle tools used by DevOps teams, result in a lack of situational awareness for users of each.
  • The attempt to force a rigid vertical and horizontal silo structure creates boundaries to the cross-collaboration that is key to good DevOps practice.

Swarming, conversely, is built on many of the same principles that underpin the success of DevOps:

  • Dynamic cross-functional collaboration, bringing different skills together into combined teams.
  • Flexible team organization, rather than rigid, hierarchical structures.
  • Individual autonomy, rather than dogmatic process (a key example being the opportunist “cherry picking” of the Dispatch Swarm).
  • A focus on the avoidance of build-up of backlogged Work in Progress.
  • Cross-pollination of skills and experience.

Conclusion

“The enterprise space doesn’t move slowly because they’re stupid or they hate technology. It’s because they have users.”
(Luke Kaines, founder and then CEO, Puppet Labs. Configuration Management Camp, Belgium, 2015)

DevOps has grown rapidly as a fundamental challenge to an established orthodoxy, bringing together the previously siloed roles of development and operations, and aggressively targeting long-established inefficiencies and dogma. It has also largely (if not entirely) grown in a new generation of organisations, often with the benefit of limited legacy and technical debt.

Importantly, it has done so tremendously successfully:

Source: 2016 State of Devops report

Now, however, DevOps is reaching traditional enterprises, where it will inevitably meet new challenges, just as those organisations will frequently struggle to adapt to it. That they need to is hard to deny. It is not just a matter of improvement; it is a matter of survival. Change, in the form of “creative destruction”, is a constant existential threat to large enterprises. Only 12% of the Fortune 500 in 1955 were still present in it in 2014.

As a result, IT organisations must adopt fresh thinking, and challenge existing orthodoxies wherever possible.

The Swarming movement has begun to deconstruct and attack the prevalence of the tiered support model, but the progress in enterprise IT Service Management has been slow, limited to a few forward-thinking organisations. However, Swarming’s similarities to key elements of DevOps thinking are undeniable, and the real-world organisational problems addressed by it are amplified into fundamental challenges for DevOps adoption.

There is therefore a compelling and urgent need to rethink the tiered support model, with a methodology that harnesses and enables the benefits of DevOps, while doing so on an Enterprise support scale. Swarming could be the answer.