Unlocking Deployment Metrics

ElasticSearch, CI/CD and a whole lotta lovin’

Published in

Sainsbury’s Tech Engineering

5 min readAug 20, 2019

When I ask teams whether they’re inspecting and adapting, they say yes. They tell me all about their retrospective formats, their regular stand ups and how they visualise their actions. This is great to see, it shows the mindset is there.

But something is often missing…

I don’t see teams measuring their cycle times and failure rates very often. It’s understandable. Features need writing, tests need fixing and code needs pushing. Teams don’t get the time to invest in these metrics. It’s seen as gold plating.

We had the opportunity to rectify this. This article will describe how we surfaced these metrics and the conversations that started to happen when we got our eyes on them.

The ideas behind our ideas

Nicole Forsgren, Gene Kim and Jez Humble provided the foundations for our metrics in the book Accelerate. These four metrics are explained wonderfully as part of the Thoughtworks Tech Radar, and I won’t go into much detail about them there. I will simply explain how we captured each.

How we run our builds

We use Jenkins. It’s a little clunky, but with some upfront effort, we’ve got it working beautifully. We’ve made use of the global shared libraries so when we add a new feature to a pipeline, everyone gets it for free. No duplicated effort, no waste. (More on this in a future blog post).

We may have to maintain a Jenkins server, which frankly isn’t that hard, but the benefits have been excellent.

And where we stored this data

Elasticsearch was our weapon of choice. It’s great at pulling large volumes of data over time and the baked in statistical analysis is outstanding. All of the metrics we capture are derived from messages that are pushed out of our CI/CD pipelines. Those messages are fired at the Elasticsearch API and into its own index.

We are big fans of the AWS provided ElasticSearch service. It comes prepackaged with Kibana, we can resize it on the fly and it runs like a dream.

Capturing developer cycle time

This metric is defined as the time from “code committed” to running in production. This was quite a straight forward metric to capture. As part of our CI/CD pipeline, when running in production, we configured it to pull out the time of the current GIT revision, using the following GIT command:

git show --no-patch --no-notes --pretty='%ci' <git commit>

Next, we needed the time the deployment completed. By also baking this into the CI/CD pipeline, we knew exactly when it was successful. Ergo, we had time from commit to production. WIN! In the four key metrics, this is described as lead time. We elected to call it cycle time because lead time was already being used as a descriptor for tasks making it from the backlog to production.

Why did we choose these percentiles?

The 50th percentile is the median. This lets us know the middle of the road success rates with our deployments. The 84th percentile captures 84% of our deployments and covers the most likely cases.

By capturing the 50th, 84th, 95th and 99th percentile, we get a complete picture of the variance in our data. These percentiles actually refer to standard deviations in normal distributions of the data. This isn’t a stats blog, so I’ll just point you this way. By picking these milestones, we’ll be able to see when our data is beginning to settle (i.e the average gets closer to the median). We’ll also be able to see delays in deployments immediately, rather than waiting for an average to be skewed.

If we were to simply take an average, we would hide this variance. By splitting this up into percentiles, we can see how we do most of the time (pretty good, half the time the code is out in less than an hour), but we also understand the problems in our system.

Failed Deployments

During this event, we also know whether the build passed or failed. If the deployment failed, we can switch a flag in our event to indicate this. We elected for a donut because it looks cooler than a pie chart and the binary nature of this data meant that a proportional based representation of the data would be the easiest to understand. If the red gets bigger, it’s time to worry.

Deployment Duration

This was a pipeline quality metric. We have shared our CI/CD code and most of the teams are running on common pipelines. This has some benefits and some costs. A slow pipeline doesn’t impact one team, it impacts 5. This is obviously undesirable. Again, this was easily derived from the pipeline. It knows when it starts, it knows when it stops.

We were happy with three minutes, but if it spikes, we’ll know.

We are quite confident with our pipeline’s speed and felt that percentiles would be somewhat overzealous. If the average begins to creep up, we may break this out into further data, but for now we’re happy with this.

A document in elasticsearch

We elected to push this information into Elasticsearch as one single document — a deployment event. This would contain everything about a specific time that some code made it into production.

{
  "appName": "my-application",
  "namespace": "app-namespace",
  "environment": "prd",
  "appVersion": "v1.2.3",
  "duration": 12345,
  "startTime": "2019-07-23T08:55:02.166Z",
  "@timestamp": "2019-07-23T09:00:09.895Z",
  "url": "https://jenkins-url/jobs/4",
  "result": "SUCCESS",
  "commitHash": "1234567c3f12345645a8e3a2dd5c86fcd123456",
  "commitTime": "2019-07-23 09:00:24 +0100",
  "commitMessage": "PLEASE WORK THIS TIME!!",
  "timeSinceCommit": 3585,
  "isDeployment": true
}

Once the pipeline pushes the information into Elasticsearch, we’re rendering it out in Kibana using their built in visualisations.

So what happened?

We’ve always been a fan of small batch sizes and low lead times. They’re essential to maintaining flow. Our median cycle time was one hour, which shows that a good portion of our code was going out quickly.

The problem is the variance. Meetings, WiFi issues, sickness, holidays and many others can conspire to slow down work that is otherwise ready to go into production. This is almost unavoidable, almost. This data has changed the conversation. Previously, we might have focused on faster deployments, or quicker application startup time. This would have been incorrect.

The greatest constraint on our system is the delay from code commit to production release. We’ve homed in on a tangible and solvable problem. Next time, we’ll discuss the steps we’re taking to improve our working practices and create a stable pipeline of homogenous(ish) work.