On the trickiness of measuring performance of engineering teams — Part 2 of 2: Five Key Metrics for Software Delivery and Operational Performance / Why we have to dig even deeper / Summary
In the previous post (part 1 of 2), I have discussed the importance of having a good reason to measure performance, the attributes of good performance metrics, and why sprint velocity does not qualify as a performance metric.
Four magic numbers
So if sprint velocity does not help us with measuring performance, what does? The DevOps research and assessment program has some answers. DORA is the “longest running academically rigorous research investigation into the capabilities and practices that predict software delivery performance and productivity”. The research was initiated by Dr. Nicole Forsgren, Jez Humble and Gene Kim and resulted in the book “Accelerate” by the same authors. In early 2019 DORA was acquired by Google Cloud who now continues the research (you can download the latest State of DevOps report here).
Based on data insights collected from thousands of DevOps practitioners and companies across the globe, DORA had originally identified four key metrics that indicate software delivery performance. In their latest research they have added a fifth one and extended the notion of performance to delivery- and operational performance. Here are the five metrics in all their beauty:
Let’s have a quick look at what each metric means:
- Deployment Frequency — the frequency of deploying code changes into production
- Lead Time for code changes — the time it takes from code committed to code successfully running in production
- Time to Restore — time it takes to restore “normal” service after an incident or a defect that impacts users occurs
- Change Fail — percentage of production releases that result in degraded service and require remediation (rollback, fix forward, patch etc)
- Availability — the percentage of time a primary application or service is available for its users
According to DORA, these five metrics are the best indicators for system-level outcomes that predict software delivery and operational performance, which, in turn, predicts the ability of an organisation to achieve its commercial and non-commercial goals. Certainly sounds like something worth keeping an eye on, doesn’t it?
Now, if you want to monitor these metrics, first thing to check is whether you have the relevant data. The good news is, if you work in a company that does not completely ignore DevOps, chances are that you do. You will probably use version control and track all code changes. You will probably track production deployments, whether you do them manually or automated as part of your continuous delivery pipeline. You will probably monitor service availability and document incidents in some way, or the tools you use will do it for you.
That doesn’t mean that setting up monitoring for these metrics will be easy. Relevant data can be scattered across many tools, from Grafana dashboards to Jira tickets. Some of the data may have poor quality, e.g. it can be incomplete or erroneous. The way the data is gathered may not be standardized across teams, making it difficult to create reusable solutions.
In other words, a lot of work will be involved in making it possible to monitor these metrics. So we should make sure that this is time well invested, and these metrics share the characteristics of good performance metrics discussed in the previous post.
Limited benefits of measuring the five key metrics
In fact, the DORA metrics focus on global outcome, and they are leading indicators for software delivery performance. They can provide insights without additional context, although they are far from being non-contextual. However, when it comes to being actionable, there is a cue. “Let’s reduce time to restore!” — a great plan, but where to start? “Let’s deploy to production more often!” — sure thing, but our regression testing is manual and takes a full day, and we have dependencies with two other teams for this release…
If improving the five DORA metrics was easy, all companies would be elite performers. Development speed and operational performance do not come for free. They are the result of a multitude of good architectural-, engineering-, product-, process-, business- and organisational decisions and execution on all levels. Measuring the DORA metrics will give companies insights into their level of performance, and help them to see whether they are improving over time, and whether the sum of their improvements is yielding good results. This is valuable in itself. However, it will not tell teams which decisions they should rethink or where exactly they need to improve in terms of execution.
Looking at the DORA research we see that, while the DORA metrics are leading indicators for organisational performance, they are, in fact, lagging indicators for concrete capabilities in engineering, architecture, product and processes, product management and organisational culture. To identify areas on improvement, we need to look at more granular constructs in order to discover more granular objectives that we can act upon.
For example, we can go through a list of technical practices (see image above) that drive continuous delivery and thus predict a better software delivery performance, less rework, lower deployment pain and less burnout, and see which of them are in place and which of them we could consider implementing. While some practices may require high-level architectural decisions (e.g. shifting left on security, improving data architecture), others may be well tackled within the scope of a team (e.g. introducing trunk-based development, improving code maintainability, increasing test automation coverage).
Identifying suitable metrics for each practice would go way beyond the scope of this post. I will limit myself to mentioning static code analysis tools as a means to measure test coverage or code maintainability. Relying to heavily on static code analysis tools has its dangers, but that’s true for any metric and a topic for a different discussion anyway.
Checklists, maturity models and gut feeling
An easy and helpful means to measure the adoption of good practices in a team are capability checklists. Such lists can be great leading indicators for desired results, simply by surfacing which practices are in place and which are still missing. Certainly it is important that any suggested capabilities are well researched, and that possible improvements are reviewed against the backdrop of specific teams and organisations. One particular well-researched capability checklist I came across when researching this topic is the DevOps Checklist by Steve Pereira.
It is also worth mentioning self-checks and maturity models in this context. As Henrik Kniberg notes, such models “can help boost and focus your improvement efforts. But they can also totally screw up your culture if used inappropriately”. Used correctly, namely as a means to learn and not to judge, the Squad Health Check model developed by Kniberg allows to surface areas of necessary action and trigger discussions that lead to concrete improvements.
Like all survey-based “health checks” the exercise is self-diagnostic, and therefore prone to error, as it largely relies on people’s subjective perception in absence of factual data. That said, where we do not have or maybe also do not need precise measurements, gut feeling can be a strong corrective factor, especially when combined with some quantified data. Many areas that predict organisational performance, such as having fun at work, strategic alignment or a clear mission, can hardly be measured other than by collecting subjective perceptions.
So, what stands at the end of this excursion into the world of measuring performance?
- It is important to remember why we are measuring performance — namely, to learn what we can improve, and test the results of improvement experiments. Measuring performance can be costly, so ask yourself how the results will help you learn and improve before you invest time and efforts to measure anything
- Good performance metrics are actionable, leading indicators that focus on global outcomes over local optimisations and provide value without much additional context
- Sprint velocity is not a (good) performance metric. It’s volatile, contextual and does not predict performance
- Lead Time, Deployment Frequency, Change Fail, Time to Restore and Availability predict software delivery- and operational performance. They are, however, difficult to actionise because they abstract more concrete and granular capabilities
- To identify possible improvements, we need to dig deeper and look at more specific data. Capability checklists and maturity models can be helpful to uncover areas of improvement, and we also should not dismiss subjective self-assessment, especially in combination with quantified data
This got rather long. If you’ve made it here, congratulations and thank you very much for your attention. I would love to hear what you think, whether you can relate to this post, whether there is something you disagree with and whether you can take away something useful for you and your teams. So just leave a comment or reach out via LinkedIn or Twitter.