DORA Metrics on a Shoestring

Published in

Go City Engineering

9 min readMay 9, 2022

Screenshot of a dashboard in four rows corresponding to deployment frequency, cycle time, time to recovery and failure rate. — DORA metrics dashboard

When platform teams ask themselves how well they are doing, they could do worse than measure how well the teams using their platform are performing against the “four key metrics”: deployment frequency, lead time for changes, mean time to recovery and change failure rate. This may seem counter-intuitive, as it means that the platform team’s measure of success depends on work done in other teams. I would argue, however, that this is precisely what the platform team’s enabling role described in Team Topologies calls for. This approach incentivises building a great “curated experience for internal teams using the platform” (to quote Matthew Skelton and Manuel Pais). It disincentivises silo-formation and any desire to assume a gatekeeper role where self-service is possible and secure. Platform and engineering teams succeed together or not at all.

Soon after the authors of Accelerate popularised the four key metrics — also known as the DORA (DevOps Research & Assessment) metrics — engineering teams were faced with the task of actually measuring them. Clearly it would not do to send out an annual questionnaire. It is easy to forget that this is how the original dataset that underpinned Accelerate came into being. What is needed is an automated approach.

A number of companies have offered to step in. sleuth.io is an example of a commercial DORA metrics offering. As it is not a tool I have used I am not able to comment on it. The mere fact that there is such a service does, however, suggest that this is a problem worth solving.

Another notable entrant, CircleCI, introduced an API dedicated to the key metrics across a given organisation’s pipelines. Lead time, for CircleCI, is elapsed time between pipeline start and pipeline end. We are really looking at a measure of cycle time, not lead time. A failed deployment is one whose pipeline does not flag success on completion. Mean time to recovery, finally, here indicates how much time has passed between an unsuccessful deployment and a subsequent successful one.

This is not entirely satisfactory especially when it comes to recovery from errors. In a majority of cases, a failed deployment to production should change nothing at all. The new workload fails to gain “readiness” in the eyes of the scheduler, and the previous version of the workload keeps serving traffic.

The key metrics, it turns out, benefit from multiple data sources.

John Lewis have described a fully automated solution to the problem with two data sources (GitLab and ServiceNow), a service, a scheduled job, a queue, a cloud function, a data warehouse and Grafana. The integration of ServiceNow supplies much-needed information about whether services are up or not, which the CircleCI API has no knowledge of. This is an architecture I am sure many teams would love to deploy to address this challenge. I know our platform team would. This post is for the teams that marvel at John Lewis’s “DORA Technical Landscape” and then ask “can we have roughly that for less?”

Focus on reuse

The starting point for our implementation will look familiar to many small to medium-sized platform teams: there’s a managed CI/CD solution (CircleCI in our case); there is one Kubernetes cluster per staging environment; and there is a self-hosted Prometheus and Grafana stack deployed in-cluster. The task is to make do with the information available from the Kubernetes control plane and the storage Prometheus has been configured to use. In short, we want to reuse as many existing components of our stack as possible. Here is a 10,000 foot view, focusing solely on the production environment:

This flow chart begins with the CI/CD tool CircleCI deploying and labelling a deployment. The DORA Metrics controller then monitors the update stream for this deployment and exposes metrics that are scraped by Prometheus and, finally, displayed by Grafana. — Overview

Of these components, only the controller is new, and its requirements are modest: it is permitted up to a tenth of a virtual CPU and 64 MB memory, but it reserves half that.

How does it work? The CI/CD tool runs a full deployment, checks whether the deployment has completed successfully, and then labels the deployment object to flag success or failure as well as cycle time.

A typical deployment carries the following annotations and labels:

metadata:
  annotations:
    dora-controller/cycle-time: "677"
    dora-controller/report-before: "1647263717"
    dora-controller/success: "true"
  labels:
    dora-controller/enabled: "true"

Storing this information in deployment annotations and labels has the advantage that we avoid tight coupling. Any CI/CD tool is suitable. GitLab, ConcourseCI, CircleCI and any other CI/CD tool will know how long a deployment took and whether it succeeded or not. Add these annotations and labels, and three of the four key metrics are within our grasp.

This information will go stale after a while (the “report-before” property guards against this), but most of the information we need is now readily available. One notable exception is time to recovery, and this is something we will come back to.

A meaningful yet affordable interpretation of the key metrics

No two systems devised to measure the four key metrics interpret them in quite the same way. The contrast between the CircleCI and John Lewis implementations exemplifies this. Before looking at implementation details, it is worth stating what we are hoping to measure.

Deployment frequency

This metric, a measure less of engineering throughput than of batch size, helps us distinguish between teams relying on long-running feature branches and teams that have brought batch size down to a level that encourages thorough review and multiple daily deployments. It may be the most visible of the four metrics in corporate settings because it lends itself to being an input to objectives and key results (OKRs).

Here the deployment frequency is simply the number of successful deployments to a given environment.

Lead time for changes

Definitions of lead time can be as wide as “from the time a product or feature appears on our board to successful completion” and as sharply defined as CircleCI’s “from pipeline start to successful completion”. I vividly remember one delivery lead counting up days on a whiteboard in prison wall notation.

We chose an adaptation of the CircleCI approach, which as we saw earlier is essentially an admission that we can only measure cycle time automatically and with confidence. We begin with the time it takes the CI/CD tool to run a deployment end-to-end (this is the signal the CircleCI implementation uses) before subtracting any time spent in approval steps.

We could have gone on to count elapsed time from a feature branch’s initial commit rather than the merge commit that triggered the release we’re dealing with. Why didn’t we attempt this? One problem is that while our measurement gains intrinsic meaning (how long did it take us to deploy this feature?), the values obtained are harder to interpret. Should we count weekend days? Bank holidays? What about feature branches that lie dormant for a day or two while the team races to resolve an urgent production issue? To arrive at a useful lead time measure we would either need to gather a lot more data (e.g. track time spent on tickets) or go back to tally marks on a whiteboard.

Mean time to recovery

There were a number of options available here, but we decided to focus on service downtime. The clock starts when a deployment drops to zero workloads in “ready” state at a time when one or more are meant to be running. It stops when the deployment returns to full strength. It seemed excessive to count a single replica going missing as downtime (there could be several others happily serving traffic), but equally the bar for “recovery” had to be higher than just a single healthy replica.

Change failure rate

This rate is simply the number of production deployments that end in failure over a given time interval.

This is admittedly a weak point in the current setup. It is trivial to report the number of failed production rollouts per day, but this rate is not as meaningful as it could be because builds that fail usually fail early on: during testing, building and deployment to staging environments. For the vast majority of failed builds, the production rollout never happens, and thus the CI/CD tool has no means of annotating the deployment.

One could argue, however, that teams also do not get credit for deployments that never reach the production environment. In that sense, a low failure rate paired with frequent deployments does say something about the engineering maturity of the team.

One event loop, two workflows

The controller observes the update stream of deployments that carry a label “dora-controller/enabled” set to “true”. Each time a deployment in scope updates — that is, when its status changes in any way, not just on restart — the controller examines its latest state and triggers two separate workflows.

Deployment status

This flowchart illustrates the deployment workflow. The controller is notified of a change and checks that the labels aren’t stale. Fresh labels are then parsed for deployment success and the relevant counters and cycle time gauge updated. — Deployment status workflow

The deployment status workflow underpins deployment frequency, cycle time and change failure rate. Note the initial step of excluding stale labels and annotations. The deployment object offers a convenient carrier for DORA information, but its lifecycle can span weeks or months while these annotations are meaningful only for a minute or less. The “dora-controller/report-before” annotation allows the controller to disregard annotations for the many changes deployment objects undergo that are unrelated to cycle time and deployment success.

Uptime status

Updates signalling downtime or steps on the path to recovery, by contrast, are never out of scope. Controllers watching deployments are among the first components to be alerted to replicas becoming unavailable (and hopefully soon regaining availability). This rapid feedback loop facilitates the following downtime detection workflow:

This flowchart illustrates the uptime workflow. Whenever an update is received, the controller checks if the deployment is at full strength or whether it is down (zero running replicas). If it is down, the downtime flag is set and the downtime timer triggered; if it is up, the controller checks if the downtime flag is set and records the downtime period. — Downtime detection workflow

Only the metric “dora_time_to_recovery_seconds” feeds into one of the key metrics, but the counter “dora_downtime_total” unlocks the workload failure rate (as a companion metric to the change failure rate).

The dashboard

The arrangement of the dashboard for our metrics will differ significantly from team to team. The controller exposes the following metrics:

“dora_successful_deployments_total” (counter)
“dora_failed_deployments_total” (counter)
“dora_downtime_total” (counter)
“dora_cycle_time_seconds” (gauge)
“dora_time_to_recovery_seconds” (gauge)

Each metric carries two custom labels: “deployment” and “namespace”.

One convenient arrangement for deployment frequency, for example, is a Time Series graph charting successful and failed deployments per day, as well as the trend line of successful deployments over two weeks. Next to it could be placed a Stat graph showing the change week-on-week. Finally, the team could use the goal it has set for itself by adding green, amber and red banding to a Gauge graph.

This screenshot shows deployment frequency across three panels: a time series graph depicting successful and failed deployments as well as a longer-term trend line; a week-on-week view and a gauge that evaluates the current monthly average frequency. — Deployment frequency

One straightforward representation of cycle time is a pie chart showing each deployment’s cycle time in order, accompanied by a week-on-week view and a Gauge calibrated to the team’s preference. In this example, the cycle time down has increased by more than two minutes. It falls short of the sub-ten minute mark that corresponds to green for this metric.

This screenshot represents cycle time in three panels: a pie chart giving deployment names, a week-on-week trend and a gauge showing the four-week average. — Cycle time

The template for time to recovery looks very similar: there is a pie chart ordering deployments in relation to one another, an indication of the weekly trend and a gauge with red, amber and green banding:

This screen depicts time to recovery by deployment, as a week-on-week trend and as a four-week “mean time to recovery” metric. — Time to recovery

The next row represents the failure rate in the form of a pie chart showing the relative weight of successful and failed deployments, a week-on-week trend and another red-amber-green gauge showing the number of failures in the preceding four-week period.

This screenshot depicts a row of three metrics for the failure rate. It shows only two failed deployments for a four-week period. — Failure rate

Reflection

Looking back, the most difficult decision was to forgo dedicated persistence for these metrics. Depending on the storage we make available to Prometheus, our visibility may stretch no further than four or even two weeks back. This is where John Lewis’s “DORA Technical Landscape” with its cloud-based data warehouse shines.

It is clearly desirable to trace the organisation’s performance against the DORA metrics month-on-month or even quarter-on-quarter. As it is, the ingress tier is likely to have filled much of our available metrics storage after a couple of weeks. One solution would be to install Thanos and persist metrics to object storage.

The strength of the current implementation is that it allows us to see where we are on our DevOps maturity journey at little cost. If what we learn proves valuable, we have every reason to ensure historical data is retained for half a year at least.

Source

If you would like to deploy the controller, you are very welcome to clone the repo github.com/gocityengineering/dora-metrics. The sample Grafana dashboard is available in the folder dashboard. If you happen to be a CircleCI user, you will find that our orb containing the “deployment-metrics” command is in the public domain too.