The Cost of 100% Reliability
A common Site Reliability Engineering (SRE) estimate is that the more reliability you want, the more it costs, with a rule of thumb that each additional 9 of reliability (eg. moving from 99% to 99.9% reliability) costs 10 times (10x) more to achieve. But what contributes to that cost increase? A recent Amazon talk on extreme reliability included examples of just where those costs stack up, and in this article I look at those examples to understand how they make costs increase so dramatically.
The talk I’m referring to is https://www.youtube.com/watch?v=2L1S0zfnIzo, an AWS re:Invent 2019 talk titled Beyond five 9s: Lessons from our highest available data planes by Colm MacCarthaigh. Colm gives lots of very interesting information on the challenges in attaining 100% reliability; I’ll be focusing on the cost-related information. Note for this article I’m using availability as the measure of reliability.
The first piece of cost-related information from Colm’s talk is that 100% reliability needs extensive testing, so much so that over 90% of development time is spent in developing and running tests, simulations and models. Our industry states pretty consistently that the majority of projects use a quarter to a third of their time in testing. So let’s take a simple example and see how this pans out.
Developing a Feature With Good Reliability vs 100% Reliability
Say a feature is estimated to take 20 days of development (including design). Add a further 10 days for testing (using the top end industry estimate of one-third time), and you have a total of 30 days cost for the good reliability feature. For the 100% reliability feature, we need much more testing, around 200 days using Colm’s talk. That means a total of 30 days for adding a feature with good reliability becomes 220 days for 100% reliability. More than seven times the cost. These are just rough estimates, but conservative and indicative of how there is a 10x increase in development costs.
The second piece of information from Colm’s talk relating to cost is the level of redundancy needed for a service to be 100% reliable. He gives many examples of different configurations that limit the blast radius, but reducing these to a minimum means that:
- The minimum starting point is always having the capacity to handle peak load. The additional cost of that will depend on your application, some have peak loads which are many multiples of normal load, but an average across the industry would have at least double the average load.
- You need to then double that capacity to provide A/B features (because although normal A/B testing can be carried out on a small subset of traffic, 100% reliability means you may need to turn off either one of the options with no traffic interruption, so at least 100% additional capacity is needed).
- You then need to add enough zones and regions that your application is never down. This will depend on your application, some already need many regions. But at a minimum that implies at least 3 zones in each of at least 3 regions. Potentially you can combine some of the A/B capacity doubling with this zone & region redundancy if the regional capacity is needed only for redundancy (rather than for regional latency). But bear in mind that the norm for the industry is to need a minimum of two zones in two regions. This implies a likely further more than doubling capacity.
Running a Feature With Good Reliability vs 100% Reliability
So taking all the above additional redundancy implies that the capacity cost of running a feature with 100% reliability is at least 10x the cost of running one with good reliability.
It’s frequently stated that 80% of the cost of a project is maintenance. Although Colm’s talk doesn’t directly reference maintenance, it’s implied by the level of testing and monitoring needed for a 100% reliable feature, as clearly likely to cost a similar order of magnitude (10x) higher, since maintenance work is similar in requirements to feature development.
So taken together, so far we have 100% reliability costing at least 10x the cost of good reliability. Which doesn’t seem that much more expensive, especially when you look at the reliability cost table here which suggests we should be looking at 1000x the cost compared to good reliability:
But there is an implicit cost I’ve missed here, the velocity of changes. The time taken to do all that testing is not just 10x more expensive, it’s also 10x longer, which means feature release is 10x slower. While this is not a direct cost in terms of money paid out, this is a direct cost in terms of not being able to push out features at any reasonable pace — all competitors will be pushing out so much faster—a cost that for many businesses would drive them out of business as their competitors leapfrog and accelerate past them. It’s hard to price this cost overhead, but an additional 10x would be a minimum estimate, bringing us to 100x the cost compared to good reliability.
And I’ve skipped yet another essential cost-relevant part of Colm’s talk — simplicity. To gain 100% reliability, you have to have a very simple system so that you can reason clearly about it (and ideally even formally prove it). This implies that most features will get rejected, which is yet another cost to the system, similar to the change velocity, with the additional consideration that many of these features will need to go through the development cycle to determine whether they are ones that can be added or need to be rejected or built a different way that doesn’t impact the core 100%-reliable flow. Again, it’s hard to price this cost overhead, but an additional 10x would be a reasonably conservative estimate, bringing us finally to 1000x the cost compared to good reliability.
So a 10x cost for each additional 9 of reliability is, if anything, a conservative estimate. And the costs add from additional testing, additional redundancy, over-provisioning, reduced change velocity, reduced feature availability, and additional complexity in the development of features needing to avoid the higher reliability flow.
Here in the Reliability Engineering team in Expedia Group™, we need to look at both the cost and the efficiency of our applications to achieve the targeted reliability of our applications. So understanding where costs add up for higher reliability is useful when it comes to advising our development teams. And part of that advice is to have the SREs involved earlier precisely so that this type of cost vs reliability can be considered when it makes the biggest impact — before they have started to engineer for too high or too low a reliability.