MTBF in SaaS is useless

Dave Owczarek
4 min readDec 25, 2021

--

MTBF seems like such an important concept. But how do you apply it to solve availability problems in a SaaS organization?

You don’t.

But let me back up just a bit. Organizations typically calculate their service availability number based on the number of minutes of downtime during a time period. The underlying data for the availability calculation (number of minutes of downtime, number of minutes of uptime, and duration of time period in minutes) along with the number of outages, is often used to calculate the MTTR (mean time to repair) and MTBF (mean time between failure). Simply put, the MTBF is the average amount of time the system ran between failures, and MTTR is the average amount of time it took to restore the service when it failed. And while MTTR is a very useful metric, MTBF is useless.

The Reliability Function

The MTBF we want is from the reliability engineering world — a calculation of component failure rates used to derive a MTBF (or MTTF) for a system. That MTBF has an associated probability distribution (failure curve) based on characteristics of the components. There are failure curves for constant failure rates (CFR), Weibull distributions, and so on. The point is, your application has some kind of failure curve and you probably don’t know what it is or how to calculate it. And even if we could derive a theoretical MTBF, what would we do with it? How does it even relate to the MTBF we measure?

What are we doing?

The MTBF we use in the SaaS world is an ex-post facto calculation of empirical performance that is based on uptime and downtime. It is not a reliability calculation based on an underlying failure curve. Because we state MTBF in units of time, it makes our brains think it is a forward-looking prediction. It is not.

Top five reasons why MTBF is useless:

  1. Every time you change code, you are also changing behavior with respect to the underlying failure curve. Quick question — how many code releases did you do during the time over which you calculated MTBF? Are you picking up what I’m laying down? The MTBF continually changes in response to changes in the software, and that’s before you get to run-time conditions.
  2. There is no notion of “spread” with MTBF. You can get to the same value through lots of different combinations of durations and intervals. Those provide crucial context — what does MTBF represent without that additional data?
  3. Your maximum MTBF is probably the length of the time period you are measuring unless you are performing some really clever calculations. So you make it through the first month of the year with no outages and your MTBF is 31 days. You make it through the second month of the year and your MTBF is 28 days, not 59 days. How cool is that?
  4. You are not measuring the failure rate, you are calculating average uptime. These are related but different things. You can have a failure that requires massive cleanup but has little impact on the customer experience. You can have tiny blip cause a customer outage for hours. We typically calculate availability, MTBF, and MTTR from the customer impact times, not the times of the actual system failure and recovery. Put another way, failure and recovery windows don’t align directly with customer impact and recovery windows. Since we aren’t directly measuring failure, how can we infer anything about the mean time between failure? It’s really mean time between impact.
  5. MTBF measures uptime. How do you optimize uptime when you are up? Asking how you become “more up” is like Nigel Tufnil saying, “these go to 11.”

I’m somewhat serious about that last point. Optimizing uptime by minimizing failure tends to bring up fault-tolerant, redundant architectures and all the vertical scaling problems that they bring. That means designing systems that don’t fail at all rather than failing fast — an expensive approach best saved for unusual circumstances. Thankfully, using largely stateless, microservice-based applications, we can now develop systems where failure is expected and managed aggressively. Things fails all the time, we get more uptime, and it’s very cost effective.

Can MTBF be useful?

Now, if your manager or an executive comes to you concerned about MTBF and asking you to report on it, it’s probably best not to respond by challenging its utility as a metric. I mean, this is a golden opportunity for dialog, right? Besides, you can slip it in somewhere at the end.

A full treatment of this topic will be the subject of a future writing. In the meantime, this is the best I can come up with:

MTBF is based on the number and duration of outages. If we focus on minimizing both of those things, the MTBF will take care of itself.

Follow this with a 20+page PowerPoint presentation (drawn from all the other reliability presentations you’ve done) and then humbly conclude, “we’ve learned it’s a lot easier to drive MTTR to zero than MTBF to infinity.”

See what I did there?

--

--

Dave Owczarek

Writing about a mix of engineering, photography, recording, music, and more.