Using Code Review Metrics as Performance Indicators

Published in

ValueStream by Operational Analytics, Inc.

7 min readJan 19, 2020

Tracking engineering performance from code review (referred to as pull requests) metrics needs to be done carefully. Focusing solely on pull requests for performance metrics leads to a myopic view of engineering performance. Is there anything that pull requests can indicate about engineering performance? Which metrics (if any) should be used, and how should they be used?

What Can Pull Requests Indicate?

To understand what Pull Requests can indicate about performance it’s important to first understand how pull request are regularly used and the role that they play in software delivery. Performance, in this context, refers to the ability of a company to turn software from an idea into code running in production (called Delivery). Delivery will be measured from the time someone starts working on a task (issues goes to “In Progress”) until the time the task is successfully delivered (deployed) to production. Some common uses of pull requests are:

Collaboration — Getting teams from different projects to share knowledge (often tribal) and understanding of services
Promoting best practices / Mentoring — Ensuring that individuals who are knowledgeable about company, language, or service specific best practices are looped into the software construction.
Risk Mitigation/Management — Companies without sufficient automation to mitigate risk (test coverage, deployment automation: single changeset deploys, rollbacks, canarying, blue green deployments) often end up pushing risk mitigation to the pull request processes. This causes pull request to function as a manual QA step with the intention of manually identifying defects in code.
Documentation — Pull requests can serve as a form of documentation and changelog that provides context around what changes were occurring and potential concerns or issues with those changes.

In modern Software Development Life Cycles (SDLC) pull requests usually occur sometime after code is development and before deployment (simplified):

Pull Requests are also executed synchronously, with many pull request tools offering support for policy gates. This means that pull requests approval block the delivery process. For example, if a pull request took 48 hours to approve that means it delays delivery for 48 hours.

The “cost” of pull requests within the context of software delivery is increased latency. What can we actually say about pull requests? Pull requests are:

A meaningful and important part of development where organizational norms and risk mitigation take place
Synchronous so time spent modifying hold up value delivery

This means that pull request duration should be optimized for:

Shorter pull requests are a synchronous step in delivering value to a customer. the overarching development/deployment process for less time.

Pitfalls of Pull Requests as a Performance Indicator

The biggest issue is when pull requests are used as the only source of performance indicators. a single stage of the software creation process to look for bottlenecks.

Many of the platforms focus on events within the context of pull requests, events like:

Time spent from Pull request ready to merge
Comments follow up times
Time spent in different states of pull requests (ready, waiting, in review, merged, etc)

Focusing solely on pull request metrics can lead to poor results and local optimums. Accurately identify bottlenecks requires more context into the software construction lifecycle than code reviews are able to provide. These are first principles in performance tuning, understand the current operation of the system. Halving Waiting time may seem like a good investment…..if it is actually a process bottleneck. Assume that the full SDLC takes ~10 days in the diagram above, and the “Review” stage takes 2 of those days (20%). A code review tool is surfacing info that it takes ~4 hours for follow on comments. The total percentage of time follow on comments are taking from delivery are:

4 hours for follow up comments / (10 days * 8 hours a day working = 80 hours) = 0.05 = 5% time is spent in follow up comments. Cutting review time from 4 hours to 3 hours will save ~1.25% of the total delivery process; instead of 80 hours the process will take 79 hours. There’s probably a better place to spend time (test automation? deployment?) that has a larger easier to scale impact on delivery performance. Even if follow on comments is the largest process bottleneck, it can only be determined by looking at the process as a whole.

Just because a metric is available doesn’t mean it’s useful. The most popular code review metrics platforms offer many metrics, including: # of pull requests being created, number of pushes, comments and many other events. The problem is these tell very little about performance or delivery. What’s the difference between a pull request that needs a single push vs one that needs multiple commits? What if the single commit PR took 4 days but the multiple commits took 1 day?

Metrics need clear and explicit definitions. Many of the current platforms offer metrics like : “impact”, “churn”, “complexity”, and “risk”. It’s critical to clearly understand how these are calculated and how they are tied back to the larger delivery process (SDLC) as a whole.

Pull requests are often a single step within a larger development process. Like all profiling and performance engineering, understanding the global effect is necessary in order to understand the impact of process changes at any individual stage.

Choosing the Correct Indicators

“What you measure is what you’ll get.” — H. Thomas Johnson. Careful consideration must be paid to which metrics are used to indicate performance because those are the metrics that will naturally be optimized for. Which metrics are necessary to determine the overarching cost of pull requests to organizational performance? Given the goal of pull requests and where they occur in the delivery pipeline what are the important “Performance” indicator ( in terms of delivery performance). This strategy relies on first principles of system observability, which Google calls Four Golden Signals:

Rate — Answers: “How often are pull requests happening?”
Duration — Answers: “How long pull requests are taking?”
Merge ratio — (error ratio) Answers: “What percentage of pull requests are actually being incorporated into the main branch?”
# Of in progress pull requests (queue depth) — Answers: “How many pull requests are in progress (outstanding)?”. This metric can be derived from the rate and duration using Little’s Law.

Many of the popular (and extremely expensive) code review metrics will offer many additional metrics. Time to first commit, # of there are many activity metrics by developer as well, # of days worked, projects contributed to. These are really interesting metrics, and may even could be used to track organizational topology or complexity, but are all poor performance indicators. These indicators may help with Organizational performance and code level metrics, but are poor indicators of delivery performance, or organizational efficiency. The timing metrics can be extremely helpful after pull requests have been empirically determined to be the delivery bottleneck. There are a couple of distinct groupings of metrics that emerge:

Delivery metrics — how long pull requests are taking, context around how they fit into the overall delivery process. Easy to measure, foundational, based in reality, how often, how much, how long.
Code quality metrics — churn, frequently committed to areas of code, language specific static analysis, etc, More ambiguous, what is Churn? Useful to determine hot spots of code, risk factors, security scanning, best practices. Extremely useful when tied back into the delivery timeline.
Organizational Indicators — topologies, knowledge clusters, comments on PRs, discussions, “culture” etc. All of these are important to keep in mind. Many good managers should have a pulse on these without needing metrics, and for a company that has solved all of its delivery problems probably a good place to tackle for increased efficiency and to help make sure teams and knowledge are balanced. I hope that these metrics are viewed as a higher level of the “hierarchy of needs”. I would be really concerned working for a company using these as organizational performance indicators without first addressing solving foundational automation tooling and delivery.

The categories above form their own pyramid:

Metrics are only important within the context of Delivery. Code quality only becomes an issue if it begins to affect delivery, by either slowing down delivery or causing failures and rework. With the case of failures and rework the major code review players have a hard time tying code back to value being created for customers, so these numbers are often viewed in isolation independent of delivery metrics.

Generating Metrics

The good news is that the foundational metrics are easy to collect and visualize, without spending hundreds or thousands a month on a code quality platforms. They are free and easily accessible through the service providers. These metrics also settle nicely at coarse intervals: daily, weekly, monthly, quarterly, etc. This is because feedback like this usually occurs during planning. This is because per hour metrics or more frequent aren’t actionable.

Both Github and Gitlab provide pull request metrics for free through their apis. We added support for gathering pull request metrics, and aggregating them to weekly view using ValueStream. Below shows an example of the critical Pull Request delivery performance metrics as a Google Sheet:

Gathering and aggregating these metrics is now as easy as two commands using ValueStream. A later post will dive into using these metrics as a first stage of debugging pull request performance.

Conclusion

Understanding how pull requests affect delivery performance requires understanding the full delivery process. If the full process isn’t taken into account, performance tuning of pull requests risks creating a local optimum. It’s critically important to consider what’s being measured and where it’s being measured from.

https://github.com/ImpactInsights/valuestream