DevOps KPI in Practice — Chapter 2 — Change Lead Time and Volume

Now is time to understand the Change Lead Time and Change Volume KPIs.

8 min readAug 3, 2018

Following our series about DevOps KPI in Practice, I have the great pleasure to present to you Chapter 2. Here we will learn the practice aspects to implement the Change Lead Time and Change Volume KPIs.

The chapter index:

Chapter 1 — Deployment Speed, Frequency and Failure
Chapter 2 — Change Lead Time and Change Volume (this)
Chapter 3 — Mean Time To Detection and Mean Time Between Failures
Chapter 4 — Mean Time To Recovery
Chapter 5 — Change Failure Rate
Chapter 6 — Efficiency
Chapter 7 — Performance
Chapter 8 — Pipeline Adoption
Chapter 9 — Final

We had too much theory in the first chapter. Now I will just make a fast intro and we will deep dive into practice.

Now it’s time to work with Change Lead Time and Change Volume.

KPI in Practice

The KPIs in this chapter intends to answer the following questions:

Change Lead Time — how much time we have from issuing a feature and deploy it to production?
Change Volume — how many stories/features and lines-of-code we push to production per deploy?

To materialize our needs we break it down in four subsections: the solution, the ingestion of metrics, the metrics’ meaning for each KPI and the presentation in Grafana.

The Solution

**Version 2** of our solution to implement the DevOps KPI in Practice

Note: some details from Version 1 was suppressed and the detailed data flow is available next.

The developers push to the remote repository all commits that he or she have made, with right issue numbers in the comments.
Gitlab receives this push and executes a webhook to Jira. This webhook contains information about all commits.
Nifi receives the same webhook data that Gitlab sent to Jira. Then, processes the commit messages and extracts the issues numbers.
For each issue number, Nifi validates the details in Jira.
And, for each commit, Nifi gets the statistics from Gitlab.
Nifi joins the webhook data, Gitlab statistics, and Jira data, producing a new JSON and put it into Redis. These data will be there, sleeping.
After some time, maybe hours or days, the deployment will be performed. And then, Jenkins pushes to InfluxDB the build information.
Nifi, time-by-time, gets the build information from InfluxDB, join with the data produced in step 6 and produce a JSON file with hashes, build information, changes, insertions, and deletions.
Nifi fets the change-set from Jenkins API, searching by project name and build number, grouping the hashes that are owned by the project.
Uses MongoDB to accumulate the metrics produced by every build, until having one that deploys a version to production.
The accumulated metrics are transformed into the line protocol and sends to InfluxDB.
InfluxDB serves the metrics to Grafana.
The user gets the metrics, interpret the KPIs and make wise decisions.

Gitlab: is our git repository manager and provide a rich REST API to query metrics that we will use to create the KPIs.

JIRA: the most used project management and bug tracker tool, where the scrum masters create the user stories and the developers take the details to implement the features. It’s not free software nor open source but is the best.

MongoDB: very popular noSQL database that we use to accumulate the metrics during the features development lifecycle, then we extract all and compute the KPIs.

The solution for this chapter is a little bit more complex, and to get a better understanding to follow the sequence diagram.

The developer pushes the commits that she or he have made. These commits have Jira issues numbers.
For each push action, Gitlab performs a webhook to Jira. This is just to maintain the default behavior, but our solution is irrelevant.
Gitlab performs a webhook to Jenkins. This starts the pipeline execution and will produce more data for us.
And one more webhook to Nifi, starting the first part of metrics correlation.
Split the webhook data by hash, producing a list where the commit is the master field.
Extract the issues numbers from commits’ message and produce a new list where the issue number is the master field.
By each issue number, try to get details querying the Jira API.
When the issue is not found, pushes a metric to InfluxDB to mark it as a dirty commit.
When the issue is found, set to Redis the issue start timestamp that to be used in the steps ahead.
Now take commit hash context again and for each one gets the commit statistics from Gitlab API.
Joins the issue count and the commit statistics, producing a new JSON.
Sets the new JSON to Redis with commit hash as the key.
In some point of time, Jenkins will send to InfluxDB the information for every pipeline execution, ended with success or with an error.
For every ten seconds, Nifi pulls InfluxDB for pipeline execution information.
Gets the change-set from Jenkins API, based on the build number and project name.
Splits the change-set by commit hash.
Gets the commit hash statistics from Redis (produced in step 12).
Joins the pipeline data, the change-set, and the commit statistics, creating a JSON with project name as the master field.
Saves to MongoDB the JSON created in the step before.
If the environment field value equals to PRD and the status is SUCCESS, executes the alternative flow, from step 21 to 31. Otherwise, jump to the end step.
Gets from MongoDB the accumulated metrics based on the project name.
Deletes from MongoDB the accumulated data, based on the project name.
Sums the statistics.
Pushes the issue count and statistics sum to InfluxDB.
Splits the accumulated metric by issue number.
Gets the issue start timestamp from Redis, based on the number.
Gets the deployment timestamp to use as issue end timestamp.
Computes the total time: end minus start.
Pushes the total time to InfluxDB.
End.

The Metric Ingestion

The ingestion happens when the accumulated data is ready to go from Nifi to InfluxDB. And I think that is clear the use of commits’ hash as a piece of common information in every step. The hash is so consistent, it is generated for every commit made by the developer and is stored in the git remote repository after the developer’s push.

For these KPIs, all the magic happens in Nifi, well, because of this the data flow is huge and have 78 processors. It’s useless to post a screenshot here, instead of it here is and screencast.

You can get that data flow template here, import in your Nifi instance, see the details and how it works.

The Metrics' Meaning

The key common value is the commit hash. To produce high-quality metrics the developers play a fundamental role: they must put the issue number in every commit message. However, everybody knows that sometimes someone forgets it, and to know how it is going I propose two additional metrics:

Dirty commits: commits with invalid or unknown issues numbers.
Orphan commits: commits without issues numbers at all.

Change Lead Time

How much time the team spent to code, test and deploy in production, counting from issue file timestamp to deployment timestamp. This KPI is crucial to understand if our DevOps initiative is really taking positive effects and if our automation is in place.

For instance: if our tests are expressively taken by human tasks, fatally our lead-time will be huge.

The time measurement starts when the issue is filed, in our issue management tool, until when the deployment happens in a production environment. With all values for each issue, we take the average, the median, the maximum and the minimum lead-time.

Threshold: there is no default threshold, but keep the value low. A way to establish the threshold value is to measure the current lead-time, before starting the DevOps adoption, and use as a baseline. The target always is to reduce this number.

Tendency: like others KPIs, the tendency will be high values at the beginning, but reducing along the time when the adoption gain more adepts and they become disciplined.

Impact: a short time frame between filing a feature and deploy to production means this: all the power is concentrated to develop the business and less power spent to execute side tasks.

Change Volume

Change is everything we code, I mean application, infrastructure or configurations, and has a feature filed in the issue management tool. The tool can be for project management, kanban, agile management, etc. This is the average number of features and line-of-codes we develop and take to production in every deployment.

Threshold: high values are bad and low values are good. High values mean that we aren’t performing many and short releases. But, how much is high and how much is low depends on you and to your company.

Tendency: at the beginning, the volume average will be huge, but tends to reduce when we break down the release into small features sets.

Impact: with a short volume average our capacity to implement and take to production is so fast and consistent, achieving the business goals and did not lose time with non-profitable tasks did by hands.

The Presentation

Using the amazing tool called Grafana, we can show the metrics.

Note that, we have de line-of-codes — LoCs — and the user stories — Stories — pushed to production for each deployment. These metrics comprise the Change Volume.

The time frame in this context is This Week.

Total Volume: the summarization of LoCs and Stories in the time frame
LoCs — AVG: the average of LoCs in the time frame
Stories — Min Count: the minimum count of Stories in the time frame
Stories — Max Count: the maximum count of Stories in the time frame
Stories — AVG: the average of Stories in the time frame
Dirty Commit: the graph of dirty commits along the time frame
Orphan Commit: the graph of orphan commits along the time frame

A live demo of Change Volume is available here.

The time frame in this example is 1 hour because we are using a faucet to produce a lot of metrics to see the things working. But I recommend using the Last Two Weeks time frame.

To understand the lead time we have the following metrics:

AVG: the time average of total lead time in the time frame
MIN: the minimum lead time
MAX: the maximum lead time
MEDIAN: the center value of lead time

Miss some details? See here a live demo of Change Lead Time dashboard.

Final Words

I hope you got the practice idea behind the lead time and change volume. And the importance of measurement to know how good (or bad :) are going the adoption.

If you have some doubts, come and post it in the comments!

That’s it’s! Thank you for reading!