Availability, MTTR, and MTBF for SaaS Defined

Dave Owczarek
10 min readJan 18, 2022

--

Basic, yes, but often misunderstood. Let’s get it right.

I’m writing this from the perspective of site reliability engineering (SRE) and technical operations in software as a service (SaaS) organizations. In working on material for articles on root cause analysis, I encountered lots of confusion looking at what is out there related to reliability. SaaS is just one field that uses terms like reliability, availability, MTTR, etc. Much of the material out there covers other domains and is not consistent with a SaaS business. This information may seem basic, but I wanted to document availability, MTTR, MTBF to demystify the confusion that I encountered. I will also refer back to this content with other articles on service levels, outages, and associated topics.

Availability, mean time to repair (MTTR), and mean time between failure (MTBF) are important concepts in service operations and in the practice of site reliability engineering. They can be used to gain insight into service performance, but they have certain limitations. In a SaaS organization, MTTR and MTBF calculations are based on the uptime and downtime numbers recorded for a time period. That raw data is also used to calculate the availability number — arguably the most common and important service level indicator (SLI). This is an important point, because in reliability engineering, the definition of downtime is based on how long it takes to repair a fault in the system. In site reliability engineering, downtime is based how long customers are impacted. Faults and downtime are related concepts, but the distinction is important — Faults are common and don’t always lead to downtime, for example.

Availability

Availability is a simple ratio of uptime to downtime — here’s the basic formula:

Availability is calculated with uptime and downtime numbers using the time that customers were impacted as the downtime number and the time that customers were not impacted as the uptime number. This calculation is based on a preset time period. We’ll use a day for our example below. That means that

You will find plenty of models out there that exclude some amount of maintenance time from uptime. That is, the organization allows for the site to be down for some amount of maintenance per month, with advance notice (usually in the middle of the night relative to the local time zone). In those cases, the time period also includes this maintenance:

However, I won’t do that here out of personal bias. Quite simply, there should no longer be any reason to take a site down for maintenance. If you do, it’s an outage, not a maintenance.

What is uptime and downtime?

The last thing we need to do is define what uptime and downtime are. Let’s start with downtime.

Downtime is whenever the service level indicator (SLI) drops below the service level objective (SLO)

Uptime is the opposite of this

Uptime is whenever the service level indicator (SLI) performs above the service level objective (SLO)

Let’s use a concrete example. We’ll use a measurement of the success rate for loading the home page of a hypothetical web site as our SLI for availability. Then we will define the SLO as 99.9%. That means that in any period of time, we expect 99.9% of the attempts to display the home page to be successful. If we get more than 1 failure out of 1,000, we are not performing above our objective, and we experience downtime. Not every organization is going to define this as downtime, but the SLO is usually set so that dropping below it indicates deterioration in the customer experience.

I am going to use the term “outage” to generically describe the state when we are performing below the SLO. The term itself usually connotes a complete failure, but in this context, it is for any time when the SLI is below the SLO. It is also commonly called an incident — I try to avoid this term because I think it’s better suited for information security and privacy issues rather than availability issues. So in my terms, an availability problem is an outage, and a security problem is an incident.

Example

Here’s an example using the SLI and SLO above based for one day. In those 24 hours, there are two outages — one for a half hour and one for 5.5 hours. Here’s how it looks on a graph:

Note that in addition to different durations (width of the notch), these outages also had different severities (depth of the notch). The first outage affected everything while the second affected just under 50% at its worst. I mention this because we are not looking at relative severity yet (also known as a partial outage), just SLO violations. From that perspective, downtime is the duration when the SLI is below the SLO, regardless of how far below it is. (Recording partial outage time is the subject of a future article.)

The availability calculation for the above chart is then:

Some basic properties

Remember the indicator we used in the example — home page load success? It may not be the one in use in your organization. It’s also possible that the severity of the outage is factored in and the downtime number represents a partial outage. There might also be compound calculations where several systems are combined to get an overall number. Leaving those kinds of more advanced issues aside, Here are some basic properties of availability:

  • It is typically calculated using minutes, not hours
  • It is reported by month, quarter, or year versus day
  • The availability SLO is expressed using a number of nines (and fives)— 99.9%, 99.95%, etc.
  • It is often reported to customers
  • It is also often guaranteed contractually, with a service level agreement (SLA) tied to it that defines refunds or other remedies associated with missing the service level

As mentioned above, downtime represents the time when customers were impacted, rather than the time the system was in a faulty state. There are plenty of circumstances where a system is in a state of fault, but the customer experience is preserved. This notion is also reinforced by the idea of error budgets. You may be in a state of fault, but if the level of that fault doesn’t push the SLI over the SLO, the system is considered healthy from the customer experience perspective and there is no penalty in the form of recorded downtime. Just to make things more confusing, the availability SLI is often generically called the “SLA” even though it may not have anything in common with contractual SLAs that your organization has with customers.

Uptime and downtime don’t respect calendar reporting periods. In any given month, your application has already accumulated uptime before the month begins. That is not counted in the current month. Likewise, if there is an outage, there will be uptime after it. Both of those are partial segments of time, since they are not bounded by two failures within the month. We count it all even though we will almost never get an even number of both downtime and uptime segments. This was the case in our example, where we had three uptime segments and two downtime segments. This will introduce some artifacts, as we will see later. There is a much better way to count MTBF, but it’s more complex so we’ll leave it for another day.

MTTR & MTBF

MTTR is the Mean Time To Repair. It is the average downtime.

MTBF is the Mean Time Between Failures. It is the average uptime.

For any given period of time:

Coming back to our example above, the calculations for MTTR and MTBF are:

Availability from MTTR and MTBF

Availability can be calculated directly from MTTR and MTBF as well:

I often hear MTTR conflated with TTR, as in, “what was the MTTR of that outage?” But the M in MTTR is arithmetic mean. If there is only one outage, you can’t have an average — it’s just a duration. (You can say TTR without the M though, at the risk of another TLA.)

I mentioned above that availability does not respect the calendar, and that you can have uptime before and after the period you are measuring that will ultimately affect the measurement. The practical effect of this is that our maximum MTBF is the length of the time period, i.e., if you have no outages, your MTBF in a 30 day month is 30 days with this method, even if the system has been running for 6 months without a fault. You definitely want to look at it quarterly and annually to get better insight. The longer the time period, the less important the error introduced by the partial segments of time at the beginning and end of each period.

Interpreting MTTR

It is important to note that these two quantities are calculations of historical performance based on customer impact. MTBF is a notably vacuous metric, because there’s not much you can do about it directly. MTTR is a different story. Time to repair tells you how long your service has been down when it has an outage. Put another way, it tells you the average time your customers have been affected. More importantly, when MTTR goes down, availability goes up. This is a simple and pragmatic way to look at it. Anything you can do to shorten the time of an outage helps your MTTR and availability numbers.

Even simple, obvious tricks can make a difference in MTTR. If you shorten the on-call response time goal from 15 minutes to 5 minutes, you potentially shave 10 minutes off every outage. Of course, it may not be that simple, but the idea here is that anything you can do to restore service faster, even if it’s unrelated to the root cause, decreases MTTR and increases availability.

Time to repair is something that can be directly addressed in many ways — here are three obvious ones:

  1. Fixing all the root and contributing causes discovered during the root cause or post-incident analysis
  2. Adjusting monitoring systems to more accurately catch the condition to provide either more working time or an automatic fix
  3. Evolving the architecture to provide faster failure capabilities or higher redundancy. (Higher redundancy is actually an argument based on MTBF, and probably the only one I’ll make.)

Interpreting MTBF

Because MTBF is described in units of time (minutes, hours), our brain wants it to be an estimate of when we can expect failure to occur. But MTBF as we have calculated it here does not serve that function. In reliability engineering, MTBF can be used in this way, but it is derived from the careful analysis of the reliability of all the interrelated parts of the system. It’s also derived in the context of a reliability curve — a probability distribution that describes the behavior of the system with respect to component failure. It’s not sufficient to say the MTBF has a value of 9 hours, it’s 9 hours at this particular point on the reliability curve.

But none of that matters here, because that is not what we are doing. We are basing our calculations on recorded customer impact data, not the fault and repair times of all the underlying components. As such, honestly, MTBF is difficult to directly interpret. Certainly when MTBF increases, MTTR must decrease. However, increasing MTBF is much more difficult than decreasing MTTR.

The old school way to increase MTBF was to build a large, vertically scaling system with multiple redundancies. It was expensive — the rule of thumb is that you pay 10 times more for every ‘9’ you add to the SLO. But there are some cases where a vertically scaled element is necessary or even desired. So MTBF can be used to prompt the question — do I build this to high scale and high redundancy, or do I build this as a horizontally scaled, fast failing system, or some hybrid?

MTBF is also important from the perspective of the on-call teams. A low MTBF means lots of outages. Outages carry at lot of process overhead, are stressful, and mostly toil. The MTBF in this context is a proxy for how good (or bad, in this case) the on-call experience is for the team supporting the application.

Bottom line

Availability is a fundamental SaaS concept. It drives the most visible performance reporting, is seen by customers, and often triggers contractual remedies when underperforming. Availability calculations get much more complex with multiple services and service dependencies. But for each element of the service being measured, there will be one or many pairs of SLI/SLO at the core of the calculations.

MTTR is an extremely important number because it directly drives availability. If MTTR goes up, availability goes down, end of story. That means you minimize the number of outages and you minimize the length of each outage. Those are tangible, measurable goals.

MTBF offers few insights that aren’t obvious to the people performing site reliability and service management tasks. MTBF can be used to explore architectural trade-offs such as the scaling model (vertical or horizontal), redundancy strategy, handling of persistent data, etc., particularly if one of these items has actually contributed to an outage. From a more humanistic perspective, if the MTBF is low, your staff is probably unhappy and stressed.

--

--

Dave Owczarek

Writing about a mix of engineering, photography, recording, music, and more.