5 Minute DevOps: Waiting Kills Quality

Published in

Defense Unicorns

9 min readSep 18, 2023

Photo by cottonbro studio: https://www.pexels.com/photo/skeleton-covered-in-spider-web-5435178/

The waiting is the hardest part
Every day you get one more card
You take it on faith, you take it to the heart
The waiting is the hardest part

“The Waiting”, Tom Petty and the Heartbreakers, 1981

Tom Petty understands one of the biggest problems in software delivery. It also sounds like he uses Jira.

Waiting is one of the most common wastes in product development. Do we need more clarity about a feature? Wait for the Product Owner to respond to the email. Do we need a go / no-go decision on a change? Wait for the Change Control Board to bless it. Waiting harms one of our goals: reducing batch size.

Why do we care about small batches of work? To summarize Damon Edward’s 2012 post “DevOps Lessons from Lean: Small Batches Improve Flow,” focusing on smaller batches:

Gives faster feedback, both on the quality and value
Reduces the risk of error or outage
Improves efficiency and lowers overhead

All of these are Good Things®. Removing process friction in the workflow helps us achieve these Good Things®. Having avoidable process friction is a Bad Thing® because that friction encourages us to do things less frequently and drives up batch size. That, in turn, reduces efficiency, increases costs, increases risk, and delays quality feedback.

Let’s explore three levels of process friction and how they impact outcomes.

Waiting is Hard

What’s your code review process? Commonly, teams use the worst possible code review method, asynchronous review. Let’s consider an example of that workflow.

This fragment of a value stream map represents the typical pattern I’ve measured on many team workflows over the years:

A developer spends five days developing a feature.
They submit their change for code review and, hopefully, alert the team that a review is available.
Four hours later, a teammate gets to a stopping place and starts reviewing the change.
After two hours of work, the code review is complete, and about 20% of the time, the reviewer requests changes.
When the developer notices the comments and gets to a stopping place, they address them and re-submit them for review.

This cycle can repeat two, three, or more times. Sometimes, the feedback is a request for clarification that would have taken less than a minute to resolve in person. Instead, it can take hours for a question to flow through the process and be answered, often resulting in follow-up questions. It’s not uncommon for a five-day change to require another week of review.

These wait times are usually hidden because the developer will probably start another task while they are waiting. From the outside, it may appear that work is happening. However, in reality, all that’s happening is that we are increasing the batch size of undelivered and undeliverable work.

Because of the asynchronous communication delays, developers will submit code for review less often, which means:

Changes will be larger
Changes will be harder to review
Code review will be less effective with more retry loops
Code conflict will be more likely when these large changes are merged

This vicious cycle that drives up our batch size also extends quality feedback loops and reduces the quality of what’s delivered. Objectively, a Bad Thing®.

More review to “improve the quality of code review” is self-defeating. It makes the vicious cycle worse by injecting more delay.

This example of a two-level review is the best case. It’s common for a senior developer to be a shared resource or loaded with meetings and only have small windows of time for code review.

But open source does it that way!

Many teams use fork&pull with async code review and say, “That’s the best practice because that’s what open-source does.” Async review is only a good practice for open source because the external contributors’ time doesn’t cost us anything, and their contributions aren’t something we depend on to achieve business goals. Therefore, delivery cost and cost of delay don’t matter. Use the right tool for the right problem.

Be a Team!

How do we improve this? Simple. Make a few hours of change, find a teammate, review the code with them, work on the corrections together, and merge the change. Any questions or misunderstandings can be handled in real time. The only wait time involved is waiting for a teammate to be free. However, if everyone is making small changes and looking for a review, the wait time for review steadily decreases to minutes instead of hours. We get our smaller batches by identifying and removing waiting. We eliminate steps that add no value, automate anything based on rules, and collaborate on everything else.

Even better for shrinking batches is pair programming. There is no code review wait time because coding and review are simultaneous. Changes can be as small as “This test passes. Push it!” Quality feedback is only as slow as the pipeline automated acceptance tests.

More Distance Makes it Harder

Intra-team communication drag is relatively easy to fix because everyone on the team has the same goals, and the team only needs to change their working agreement. However, when inter-team dependencies exist, wait times can grow by orders of magnitude.

Change Advisory Boards (CABs) are an excellent example of this problem. With a CAB (or change board, change control board, etc.), each change must be approved by a group outside the team. They review the documentation about the change, why it needs to happen, ensure code was reviewed, validate there’s a way to back out the change, review test results, etc. This legacy IT Service Management process is done in the name of “quality,” “security,” and “compliance.” Of course, none of the people approving the change have enough context to do more than make an emotional decision. Did they get a bad night’s sleep because of another team’s change? “Go do more testing and make sure!”

Where code review can add hours of wait time, a CAB can add days. Not only does it drive up the batch size, but because there is a specific window for change approval, it encourages people to rush to get changes into the current batch or be forced to wait for the next window. What does this do to quality?

We found that external approvals were negatively correlated with lead time, deployment frequency, and restore time, and had no correlation with change fail rate. In short, approval by an external body (such as a manager or CAB) simply doesn’t work to increase the stability of production systems, measured by the time to restore service and change fail rate. However, it certainly slows things down. It is, in fact, worse than having no change approval process at all.

Excerpt From Accelerate
Nicole Forsgren Ph.D., Jez Humble & Gene Kim

Learning the Wrong Lessons

A friend works for a company that decided to re-institute change boards after several impacting incidents. Rather than using a postmortem to identify and correct the underlying causes, they decided that having a committee grill people on every change and adding days of delay would be a better approach.

Recently, there was another incident in my friend’s area, and they were tasked with determining the root cause. During discovery, they discovered that the change that would have prevented the incident had been ready to deliver but was waiting on the change council process, which requires a minimum of 48 hours. Had that change deployed, there would have been no incident.

CAB is a Bad Thing®, a song and dance meeting that makes us less safe by driving up batch size while providing a false sense of security to the unenlightened. The only identifiable value is distributing responsibility and blame, making it harder to point fingers when things fail. Of course, the impact of failure for large changes is higher, but at least no one will get fired. Everything a CAB tries to do can be done more effectively by automating all of the controls in the delivery pipeline and catching problems at the source.

The Worst

We know smaller batches correlate with higher quality, fewer unused features, better architecture, and lower costs. So, naturally, teams delivering software for the government will usually be forced to do the opposite.

Most government programs work by building “complete solutions” and then applying for an Authority To Operate (ATO) using the Risk Management Framework.

Everything looks simple when you boil it down to a graphic, but when I reviewed NIST’s “quick start guide” for RMF, I saw it was 30 pages long and focused on all the roles involved with every step, basically a 30-page RACI matrix. Often, those people are in two or more different organizations. So, each step requires handoffs, many of which are between different organizations with different priorities.

Ticket Driven Conversations

Another friend related an example that happens frequently. They submitted an ATO package for a review and received a question from a reviewer that, had it been a phone call, would have required a 10-minute conversation. Using the official approval process was still 10 minutes of work, but it required an additional 21 days of waiting for the reviewer to get the answer and respond. If that answer required more clarification, 21 more days. They said it’s common to bypass the system and send reviewers things to “pre-review” before using the official process for approval.

Collaborate!

The fix for this is to bring people together, both in time and in goals. The current goal, approving something that’s been built, is wrong. We should partner with external teams and organizations to approve the construction method, apply those controls to automated validation, and get continuous feedback that every code change meets those controls. This also means we can get continuous feedback that we are building the right thing instead of just an approved thing.

Just as important is that we should prioritize getting things done. If a process injects avoidable wait time, it’s a broken process and should be changed to remove it. Avoid asynchronous conversations for things that matter. Have a meeting, collaborate, discuss the issues, get conclusions, and finish things. Prioritize the end users and the goals, not the process.

Solution: Remove Risk Management Theater

The examples above are common methods that organizations use to manage risk. Unfortunately, those risk management implementations do exactly the opposite by reducing the efficiency of the process with avoidable wait times. This increases the transaction cost of change, which increases the size of change and the potential blast radius of change. We can reduce risk by looking at every step in the process and removing wait times.

Eliminate: If we don’t need a process, then remove it. Not only does this remove the wait time, but it saves busy work.
Automate: Remove people from repetitive work. People are slow, expensive, and cannot perform the same task the same way every time. Robots reduce wait times and lower the cost of change.
Collaborate: If it requires creativity rather than repetition, work together. Have real-time conversations and solve problems instead of playing ping-pong through a ticketing system. Stop starting work and start finishing it.

By identifying and removing wait times, we can start seeing all of the benefits of smaller batches of work. We can spend less time, effort, and money delivering things that matter instead of waiting to deliver things no one needs.