You Can’t Build More Nines

Clint Byrum
5 min readFeb 14, 2025

--

Software teams are built to.. well.. build. We have design processes, RFC processes, change management processes.. lots of processes. All of them tend to be optimized for building.

But, inevitably after building enough complexity, we start to realize that our systems are not reliable enough. We start to measure uptime and, lo, there are not enough nines!

Of course, our first inclination is to build our way to more nines. Build CI/CD pipelines. Build canary deployments. Build a platform. Build synthetic testing.

It’s usually at this point that the dissolution sets in. Why aren’t we getting more nines? We built stuff for that!

But what are we measuring when we measure uptime anyway?

We are, in effect, measuring how often we see what we want to see when we look. Is it up now? Yes. How about now? Yes. Did our users succeed mostly?

Well if that’s what we’re measuring, and we’re trying to build more nines, we have to ask: what is a nine composed of?

One might say it is composed of time slices in which we’re up, or successful events. So, could we say then that if we build a system that is up, that we built the nines?

Unfortunately, math would like a word. Those nines are a percentage, so we’re always subject to everything in the denominator.

Or, to paraphrase a common military euphemism: “the entropy gets a vote.” No matter how bullet-proof you build the components of your system, the only way to make nines go up is to be ready to deal with the host of surprises that take them back down. By definition a percentage is a zero sum game. So, really, to add nines to your target, you have to subtract something else. You have to subtract the faults.

But, but, I’ve built systems to add nines!

You’ve probably built mostly two things: Fault avoidance, and redundancy.

Fault avoidance is the easy part. End-to-end tests that run pre-merge avoid some faults. Canary deploys avoid another class. Type checkers, linters, unit tests, all avoiding classes of faults. These will certainly increase the nines in the components of your system where they are deployed.

But again, this doesn’t do much to avoid the surprises of a complex system meeting the entropy of the real world. And since you’ve now optimized the velocity of changes entering your components by making a very powerful, confidence-building CI/CD pipeline, you also increase the velocity of change. No matter how good your pre-merge and post-deploy automated testing and rollback system is, it will always be supporting the change process. And changes are a source of faults. So while having great fault-avoidance automation will certainly subtract faults, it will also add some new ones back in.

So, you did the easy thing, but now you need to subtract more faults. Now you need to think about making the collections of components more redundant. After all, it’s predictable to have a broad class of problems like “ran out of computers” or “Network suddenly stopped networking.” or “Back-hoe cut fiber connection.”

For this, you build global database replication, leader elections, sharding, tombstones, write forwarding, queuing, eventual-consistency, etc. etc.

All of this fancy redundancy subtracts those big, obvious, predictable failures. So surely you’ll get the precious nines you’ve been longing for from this. Finally, you’ve done it. You’ve built the nines!

Except, you will also note some new faults. Global DB replication runs out of transaction log space. Leader elections take too long. Shards get flaky. Tombstones build up faster than they can be reaped. Metrics stop flowing. Logs are lost. Etc. etc. The denominator for good and bad events has added so many things, you might even make it worse before it gets better.

So, it’s goat herding for me then?

Don’t give up here. More nines are achievable, and sustainable, obviously. Many of us have done it. But, whether we consciously know this or not, we didn’t do it just by building software. Whether we hit three or six nines, and whether or not we realized it at the time, we built something a bit more, something a bit harder to measure than uptime, or redundancy or fault avoidance:

We built our organization’s resilience.

This didn’t happen by accident though. Somebody committed to driving the faults down. Somebody gave cover for those down at the sharp end feeling the pain of those faults. It went best when it was our leaders.

While all this redundancy was rolling out, somebody had the time and space to draw a map of all the faults, and to tell everyone else about it.

As we were talking to our customers we probably listened to them, and made sure that everyone understood what they expected the system to do, and vice-versa, being clear about what we do and don’t promise.

When alerts went off, I hope we took them seriously, and that we made sure they were representative of a real signal, with real plans for what to do with that signal.

If we were lucky, we made sure people were prepared by sending them to incident command training, giving them time and space to practice, run game days, devise role-playing exercises, and complete disaster recovery testing.

And finally, after all that, it’s very likely that when entropy reminded us that you really can’t predict what it’s going to do, we assembled an incident response team that professionally and efficiently worked to a resolution. A team that wrote down weird things they saw, from odd log messages to frustrating interrupts from outside the response team. And we made sure that they learned from those stories, and built up our collective wisdom.

Most importantly though, I would posit that nobody gets to any real, sustainable reliability, any honest version of more nines, without making space for everyone to feel safe to fail, listening to their experiences, and promoting the pockets of resilience and safety that inevitably exist in every organization.

We didn’t build those nines. We built our organization’s resilience.

--

--

Clint Byrum
Clint Byrum

Written by Clint Byrum

I’ve been getting paid to play with computers for 25+ years. Now I read code and design distributed systems when I’m not swinging the word hammer.

Responses (1)