Metrics and measurement: Some lessons for DevOps from ITSM

Measurement is a big deal in agile and DevOps, and metrics are a common discussion topic on social media and conferences. Measurement is hugely important for an organisation seeking to understand — and iteratively improve — its inefficiencies and risks.

However, as any of us who have spent years in IT service organizations can attest: Measurement is easy, but meaningful measurement is hard.

And believe me, ITSM deserves is reputation for sometimes getting bogged down in metrics. In the new era of infrastructure, we talk about the importance of servers being cattle, not pets. But metrics can become pets too, and that’s often a problem.

Let’s look at some fundamentals. Broadly you can measure two things: activities and outcomes. The outcomes are why we measure. The best measurement framework seeks to balance a set of each of these in support of those outcomes, but that balance is not always easy to find.

Measuring outcomes, of course, is a good thing: Outcomes are what means the most to your customers. Outcomes are where your value is delivered. Outcomes matter. But measuring outcomes alone is insufficient, as those measurements alone can’t help you to achieve your goal. You may know your destination, but without tracking your progress towards it, it remains a dot on the horizon.

Measuring activities gives you a means to see how any given step towards an outcome is going. Activity measures, however, are fraught with potential pitfalls:

Firstly, activity measures often end up being gamed. A quick response target, for example, may focus a respondent on the initial tasks associated with a piece of work, over and above its actual completion. Is it really more valuable to “tick the box” on each issue in a queue if the time would be better spent progressing specific issues to resolution?

Secondly, when the outcome requires a stream of activities, you can often hit every target on the way and still fail overall. Delivery of code into production might involve a sequence of measurable tasks, perhaps iterating an unknown number of times for each instance. Every actor in the process might successfully meet their point activity target, but an overall delivery target could still fail.

In ITSM this challenge manifests itself in the balance between team-level “operational level agreements” and the overall “service level agreement”. An issue, for example, may need to be solved in 2 days to meet customer commitments. Any given team may be allowed twelve hours to complete their activities, but if we get more than 4 re-assignments, an overall failure can happen even if each team remains on-track.

Thirdly, measuring an activity can entrench it. The measurement itself becomes an obstacle to change. Unfortunately, it can be difficult to change an activity if implementing that change impacts a metric that people care about.

In customer service industries like ITSM, we’re good at building norms into “best practice” metrics which reinforce behaviours. First time fix is a good example of a metric which has entrenched itself into the service culture. It is a a simple measure of the proportion of issues which get resolved without further escalation from the first point of contact.

Even this, however, is problematic.

After all, if my way of standing out as an individual is to fix more things first time than my colleagues, why would I spend time sharing that knowledge when I could instead boost my own scores by spending that time applying it? Clumsy target setting can disincentivise the agent from sharing their knowledge: perhaps their own measure fix rate drops, and they erode their differentiating edge against their co-workers. Is the agent with the highest first time fix rate really the most intuitive and insightful troubleshooter, or could they simply be hoarding a stockpile of personal slam-dunk fixes?

Finally, perhaps the most serious issue I have encountered with metrics in my ITSM career is the impact they can have on innovation and change. Measurements get baked into the culture and structure of an organization. They get linked to performance reviews, bonuses, and the success or failure of projects and teams. It can be hard to let go of them.

I have seen companies, for example, struggle to implement improvements for customers because specific metrics would look worse after the change. One striking example: I was once told by a portfolio manager working for a large outsourcing service provider that there was a disinclination to offer customers modern self-service technologies, because they were paid a significant sum per service desk first-time fix. Customers solving their own problem would not contribute to this count. The metric became a limitation.

Even in more regular situations, a change to a practice can make existing metrics redundant. This is of particular significance if the trend in this measure is important. New measures tend to revert any monitored trend back to “day one”. Our introduction of swarming at BMC, in place of more conventional 3-tier support, required the abandonment or reset of a number of measured long-term trends. Executives often care about those trends, particularly when they are targeted on them. To change, you may need their backing.

In short: if we’re trying to improve, we usually need to measure. But measurements themselves can easily become a barrier to improvement. Metrics themselves need to be agile, or they become an impediment to agility.