Embracing Risk

A story of how Reliability Engineering is about deciding exactly how unreliable you want to be.

I am a Site Reliability Engineer at Google, annotating the SRE book in a series of posts. The opinions stated here are my own, not those of my company.

This is commentary on the first part of Chapter 3: Embracing Risk.


Embracing Risk

Written by Marc Alvidrez
Edited by Kavita Guliani
You might expect Google to try to build 100% reliable services — ones that never fail. It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the numbers of features a team can afford to offer. Further, users typically don’t notice the difference between high reliability and extreme reliability in a service, because the user experience is dominated by less reliable components like the cellular network or the device they are working with. Put simply, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability! With this in mind, rather than simply maximizing uptime, Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness — with features, service, and performance — is optimized.

For many users, short outages are indistinguishable from a WiFi router being slow. It’s only after a service has been down for a few minutes that anyone really notices.

A fun activity is for a generally reliable highly popular service that’s had a short outage, jumping on https://search.twitter.com/ and seeing if anyone noticed! This is a fun activity generally reserved for those who work at the company, because a 30 second global outage will generally go totally unnoticed, and the world will never know.

You can imagine, depending on how many users depend on you, and the negative impact to them when your service is unavailable, your reliability targets would be very different.

Here we will discuss why and how to manage and measure risk, and in the next post, discuss how to establish reasonable targets.

Managing Risk

Unreliable systems can quickly erode users’ confidence, so we want to reduce the chance of system failure. However, experience shows that as we build systems, cost does not increase linearly as reliability increments — an incremental improvement in reliability may cost 100x more than the previous increment. The costliness has two dimensions:

The cost of redundant machine/compute resources

The cost associated with redundant equipment that, for example, allows us to take systems offline for routine or unforeseen maintenance, or provides space for us to store parity code blocks that provide a minimum data durability guarantee.

How many times have you seen an IT system with a single webserver, single database server, or even a single load balancer. Lack of redundancy is both the cause of outages, and also a huge factor in the Mean-Time-To-Recovery (MTTR).

It costs a lot to have redundant systems, but it costs a lot more when you have to buy it, deliver it and configure it when the system is already down and you’re losing money!

One of the ways that Google, and SRE at Google, make sure that services stay up all the time is by having redundant systems that are in constant use, so when one fails, the remainder pick up the slack immediately.

Additionally we have planned maintenance windows literally continuously on a schedule that means that at any time, all or part of a datacenter could be offline with little notice.

(Important: We decide where our services run based on our maintenance schedule to make sure that we never have scheduled downtime for all or most of the serving capacity: we embrace risk — we don’t invite it!)

Because we embrace risk in this way: we have continuous small outages, but never tolerate a ‘Single Point of Failure’ or accept the idea of any user-facing system being ‘Down for daily/weekly/monthly Maintenance’.

You don’t see a message on www.google.com saying that you can’t do a search right now because we’re doing scheduled maintenance. That’s just not how we do things.

The opportunity cost

The cost borne by an organization when it allocates engineering resources to build systems or features that diminish risk instead of features that are directly visible to or usable by end users. These engineers no longer work on new features and products for end users.

Inside Google SRE, we sometimes manage the opportunity cost issue by having the SREs do the work to build features that ‘diminish risk’. But not always: often this is shared by our dev team in a balanced way. We do this in constant communication with product management and leadership.

The pact is, we won’t ask you to divert development resources unless we can show you’re not meeting expectations, but if you are, then SRE will carry the remaining risk.

In SRE, we manage service reliability largely by managing risk. We conceptualize risk as a continuum. We give equal importance to figuring out how to engineer greater reliability into Google systems and identifying the appropriate level of tolerance for the services we run. Doing so allows us to perform a cost/benefit analysis to determine, for example, where on the (nonlinear) risk continuum we should place Search, Ads, Gmail, or Photos. Our goal is to explicitly align the risk taken by a given service with the risk the business is willing to bear. We strive to make a service reliable enough, but no more reliable than it needs to be. That is, when we set an availability target of 99.99%,we want to exceed it, but not by much: that would waste opportunities to add features to the system, clean up technical debt, or reduce its operational costs. In a sense, we view the availability target as both a minimum and a maximum. The key advantage of this framing is that it unlocks explicit, thoughtful risktaking.

Fun fact, when a system is too reliable, we organize planned outages of systems, just so the clients of those systems get used to the fact they can break, and engineer their services to degrade gracefully!

For example, during the review of a system, there was a concern that the system was working well, but there are some potential failure modes that would take a while to resolve. The system is internal facing and if it was down, then there were alternatives that were just as useful, just less convenient.

If a system gets too reliable, then the team who runs it feels like they need to keep it that reliable, even though there are potential failure modes that are very expensive to mitigate.

In order to make sure that we didn’t fall into the trap of having to do lots of extra development and maintenance to make the system more reliable than it needed to be, the advice was given: “Attempt to reach 99% reliability. If you fail, that’s okay, if you succeed, have planned downtime to get your reliability down to 99%.”

This advice to make sure that the system was down sufficiently often came from senior Google SRE leadership: Reliability Engineering is sometimes about choosing how unreliable to be.

This is fundamentally different to planned downtime for maintenance. This is planned downtime to test that dependencies degrade gracefully. This philosophy can be applied to any system: websites, APIs, storage systems, networks.

As an alternative approach, see chaosmonkey from Netflix, which attempts to deal with the same issue.


Measuring Service Risk

As standard practice at Google, we are often best served by identifying an objective metric to represent the property of a system we want to optimize. By setting a target, we can assess our current performance and track improvements or degradations over time. For service risk, it is not immediately clear how to reduce all of the potential factors into a single metric. Service failures can have many potential effects, including user dissatisfaction, harm, or loss of trust; direct or indirect revenue loss; brand or reputational impact; and undesirable press coverage. Clearly, some of these factors are very hard to measure. To make this problem tractable and consistent across many types of systems we run, we focus on unplanned downtime.

Unplanned downtime is very distinct from either the planned maintenance discussed above, but the planned downtime would come out of this same failure budget.

With planned downtime for maintenance, you might have a goal such as “Load balancers will be serving traffic 99.999% of the time.” This means that you have to have enough load balancers serving users to take some out for maintenance occasionally without service disruption. Not that all your load balancers are serving all the time.

For most services, the most straightforward way of representing risk tolerance is in terms of the acceptable level of unplanned downtime. Unplanned downtime is captured by the desired level of service availability, usually expressed in terms of the number of “nines” we would like to provide: 99.9%, 99.99%, or 99.999% availability. Each additional nine corresponds to an order of magnitude improvement toward 100% availability. For serving systems, this metric is traditionally calculated based on the proportion of system uptime (see Time-based availability).

I run a system that’s typically “three and a half nines”. With some subsystems that are “four nines”.

Running a “five nines” system requires a huge investment, and is about as reliable as a serving system can ever expect to achieve. It’s both incredibly difficult and an amazing achievement. That’s part of why we named our SRE lounge at GCP Next ’17 “The Fifth Nine”.

Benjamin Lutch and Ben Treynor Sloss under the sign for The Fifth Nine at GCP Next 2017.

Time-based availability

Using this formula over the period of a year, we can calculate the acceptable number of minutes of downtime to reach a given number of nines of availability. For example, a system with an availability target of 99.99% can be down for up to 52.56 minutes in a year and stay within its availability target; see Availability Table for a table.

99.95% availability is 21 minutes a month. This is basically burned into my brain. My pager has a 5 minute response time requirement (“time to keyboard”), so if my product has 1 global outage a month, we have a budget of 16 minutes to diagnose and resolve it.

Fortunately, the way we account for failures, if half the system is down, then that doubles our recovery budget. If 10% is down, we’ve got plenty of time to fix it and still meet expectations!

In reality, we fix everything as fast as we can, because that gives us more budget for the rest of the month.

At Google, however, a time-based metric for availability is usually not meaningful because we are looking across globally distributed services. Our approach to fault isolation makes it very likely that we are serving at least a subset of traffic for a given service somewhere in the world at any given time (i.e., we are at least partially “up” at all times). Therefore, instead of using metrics around uptime, we define availability in terms of the request success rate. Aggregate availability shows how this yield-based metric is calculated over a rolling window (i.e., proportion of successful requests over a one-day window).

I own systems that calculate uptime in this manner, as well as the query success system listed below. I will also illustrate how these principles can be combined.

Aggregate availability

For example, a system that serves 2.5M requests in a day with a daily availability target of 99.99% can serve up to 250 errors and still hit its target for that given day.

250 errors could be caused by anything from: CRC errors on network card, 1 in a million code bugs, a system losing power at exactly the wrong moment. When you deal with this kind of error budget, you really appreciate the need for retries and multiple failure domains.

A constant low rate of failures causes a “slow burn” of your availability. So it’s important to both deal with issues that are low impact and continuous as well as issues that cause your availability to dip suddenly.

In a typical application, not all requests are equal: failing a new user sign-up request is different from failing a request polling for new email in the background. In many cases, however, availability calculated as the request success rate over all requests is a reasonable approximation of unplanned downtime, as viewed from the end-user perspective.

Typically high-value requests are broken out into a separate metric, with tighter requirements. It’s best to make the thing you’re measuring and requiring to be reliable to be an accurate representation of user-pain.

It is entirely fine to have a reliability goal of 99.9% for correctly serving a webpage to a user, and then a 95% reliability goal for the little notification bell on the top right loading properly. If the page doesn’t load, that’s user-pain. If the user doesn’t get notified that another person followed them on medium.com within 10 seconds of it happening: no one will even notice.

Quantifying unplanned downtime as a request success rate also makes this availability metric more amenable for use in systems that do not typically serve end users directly. Most nonserving systems (e.g., batch, pipeline, storage, and transactional systems) have a well-defined notion of successful and unsuccessful units of work. Indeed, while the systems discussed in this chapter are primarily consumer and infrastructure serving systems, many of the same principles also apply to nonserving systems with minimal modification.
For example, a batch process that extracts, transforms, and inserts the contents of one of our customer databases into a data warehouse to enable further analysis may be set to run periodically. Using a request success rate defined in terms of records successfully and unsuccessfully processed, we can calculate a useful availability metric despite the fact that the batch system does not run constantly.

Measuring batch and pipeline systems is exceptionally hard to do. This has been quite understated here. I won’t go into it here, but we will likely discuss it in Chapter 25: Data Processing Pipelines.

Most often, we set quarterly availability targets for a service and track our performance against those targets on a weekly, or even daily, basis. This strategy lets us manage the service to a high-level availability objective by looking for, tracking down, and fixing meaningful deviations as they inevitably arise. See Service Level Objectives for more details.

There are two more ways of monitoring systems I want to introduce here, because the above two are somewhat simplistic for some systems.

System uptime can be defined as instead of “Time when the system is working” to be “Time when the aggregate availability of the system is above a threshold.”

For instance: imagine a system as mission critical as a caching service like memcache. Any error rate at all can mean huge impact, as a frontend request might do 50–100 memcache requests, and any failure will result in extremely slow database roundtrips. So you might call that system only ‘up’ when it’s returning 99.99% successes. And when it goes to 99.98%, consider it ‘down’. Then you might require that the memcache service be up 99.95% of the month.

Additionally, some systems, especially systems that carry no state whatsoever (i.e. every request is treated equally) can be probed for uptime or reliability instead of counting the number of successful queries. This is an approach that’s very useful for a network switch or a load balancer, where invalid inputs should be responded to with errors, but synthetic probes should succeed every time. You can very easily set a reliability threshold of “99.99% of probes succeed.”


Coming next

The second half of Chapter 3: Embracing risk, titled “Risk Tolerance of Services”. Where we discuss how to set reliability goals, partnering with product, appreciating the customer, and embracing risk.