Embracing Risk

A story of how Reliability Engineering is about deciding exactly how unreliable you want to be.

Stephen Thorne
Mar 23, 2017 · 11 min read

I am a Site Reliability Engineer at Google, annotating the SRE book in a series of posts. The opinions stated here are my own, not those of my company.

This is commentary on the first part of Chapter 3: Embracing Risk.


Embracing Risk

For many users, short outages are indistinguishable from a WiFi router being slow. It’s only after a service has been down for a few minutes that anyone really notices.

A fun activity is for a generally reliable highly popular service that’s had a short outage, jumping on https://search.twitter.com/ and seeing if anyone noticed! This is a fun activity generally reserved for those who work at the company, because a 30 second global outage will generally go totally unnoticed, and the world will never know.

You can imagine, depending on how many users depend on you, and the negative impact to them when your service is unavailable, your reliability targets would be very different.

Here we will discuss why and how to manage and measure risk, and in the next post, discuss how to establish reasonable targets.

Managing Risk

The cost of redundant machine/compute resources

How many times have you seen an IT system with a single webserver, single database server, or even a single load balancer. Lack of redundancy is both the cause of outages, and also a huge factor in the Mean-Time-To-Recovery (MTTR).

It costs a lot to have redundant systems, but it costs a lot more when you have to buy it, deliver it and configure it when the system is already down and you’re losing money!

One of the ways that Google, and SRE at Google, make sure that services stay up all the time is by having redundant systems that are in constant use, so when one fails, the remainder pick up the slack immediately.

Additionally we have planned maintenance windows literally continuously on a schedule that means that at any time, all or part of a datacenter could be offline with little notice.

(Important: We decide where our services run based on our maintenance schedule to make sure that we never have scheduled downtime for all or most of the serving capacity: we embrace risk — we don’t invite it!)

Because we embrace risk in this way: we have continuous small outages, but never tolerate a ‘Single Point of Failure’ or accept the idea of any user-facing system being ‘Down for daily/weekly/monthly Maintenance’.

You don’t see a message on www.google.com saying that you can’t do a search right now because we’re doing scheduled maintenance. That’s just not how we do things.

The opportunity cost

Inside Google SRE, we sometimes manage the opportunity cost issue by having the SREs do the work to build features that ‘diminish risk’. But not always: often this is shared by our dev team in a balanced way. We do this in constant communication with product management and leadership.

The pact is, we won’t ask you to divert development resources unless we can show you’re not meeting expectations, but if you are, then SRE will carry the remaining risk.

Fun fact, when a system is too reliable, we organize planned outages of systems, just so the clients of those systems get used to the fact they can break, and engineer their services to degrade gracefully!

For example, during the review of a system, there was a concern that the system was working well, but there are some potential failure modes that would take a while to resolve. The system is internal facing and if it was down, then there were alternatives that were just as useful, just less convenient.

If a system gets too reliable, then the team who runs it feels like they need to keep it that reliable, even though there are potential failure modes that are very expensive to mitigate.

In order to make sure that we didn’t fall into the trap of having to do lots of extra development and maintenance to make the system more reliable than it needed to be, the advice was given: “Attempt to reach 99% reliability. If you fail, that’s okay, if you succeed, have planned downtime to get your reliability down to 99%.”

This advice to make sure that the system was down sufficiently often came from senior Google SRE leadership: Reliability Engineering is sometimes about choosing how unreliable to be.

This is fundamentally different to planned downtime for maintenance. This is planned downtime to test that dependencies degrade gracefully. This philosophy can be applied to any system: websites, APIs, storage systems, networks.

As an alternative approach, see chaosmonkey from Netflix, which attempts to deal with the same issue.


Measuring Service Risk

Unplanned downtime is very distinct from either the planned maintenance discussed above, but the planned downtime would come out of this same failure budget.

With planned downtime for maintenance, you might have a goal such as “Load balancers will be serving traffic 99.999% of the time.” This means that you have to have enough load balancers serving users to take some out for maintenance occasionally without service disruption. Not that all your load balancers are serving all the time.

I run a system that’s typically “three and a half nines”. With some subsystems that are “four nines”.

Running a “five nines” system requires a huge investment, and is about as reliable as a serving system can ever expect to achieve. It’s both incredibly difficult and an amazing achievement. That’s part of why we named our SRE lounge at GCP Next ’17 “The Fifth Nine”.

Image for post
Image for post
Benjamin Lutch and Ben Treynor Sloss under the sign for The Fifth Nine at GCP Next 2017.

Time-based availability

Image for post
Image for post

99.95% availability is 21 minutes a month. This is basically burned into my brain. My pager has a 5 minute response time requirement (“time to keyboard”), so if my product has 1 global outage a month, we have a budget of 16 minutes to diagnose and resolve it.

Fortunately, the way we account for failures, if half the system is down, then that doubles our recovery budget. If 10% is down, we’ve got plenty of time to fix it and still meet expectations!

In reality, we fix everything as fast as we can, because that gives us more budget for the rest of the month.

I own systems that calculate uptime in this manner, as well as the query success system listed below. I will also illustrate how these principles can be combined.

Aggregate availability

Image for post
Image for post

250 errors could be caused by anything from: CRC errors on network card, 1 in a million code bugs, a system losing power at exactly the wrong moment. When you deal with this kind of error budget, you really appreciate the need for retries and multiple failure domains.

A constant low rate of failures causes a “slow burn” of your availability. So it’s important to both deal with issues that are low impact and continuous as well as issues that cause your availability to dip suddenly.

Typically high-value requests are broken out into a separate metric, with tighter requirements. It’s best to make the thing you’re measuring and requiring to be reliable to be an accurate representation of user-pain.

It is entirely fine to have a reliability goal of 99.9% for correctly serving a webpage to a user, and then a 95% reliability goal for the little notification bell on the top right loading properly. If the page doesn’t load, that’s user-pain. If the user doesn’t get notified that another person followed them on medium.com within 10 seconds of it happening: no one will even notice.

Measuring batch and pipeline systems is exceptionally hard to do. This has been quite understated here. I won’t go into it here, but we will likely discuss it in Chapter 25: Data Processing Pipelines.

There are two more ways of monitoring systems I want to introduce here, because the above two are somewhat simplistic for some systems.

System uptime can be defined as instead of “Time when the system is working” to be “Time when the aggregate availability of the system is above a threshold.”

For instance: imagine a system as mission critical as a caching service like memcache. Any error rate at all can mean huge impact, as a frontend request might do 50–100 memcache requests, and any failure will result in extremely slow database roundtrips. So you might call that system only ‘up’ when it’s returning 99.99% successes. And when it goes to 99.98%, consider it ‘down’. Then you might require that the memcache service be up 99.95% of the month.

Additionally, some systems, especially systems that carry no state whatsoever (i.e. every request is treated equally) can be probed for uptime or reliability instead of counting the number of successful queries. This is an approach that’s very useful for a network switch or a load balancer, where invalid inputs should be responded to with errors, but synthetic probes should succeed every time. You can very easily set a reliability threshold of “99.99% of probes succeed.”


Coming next

The second half of Chapter 3: Embracing risk, titled “Risk Tolerance of Services”. Where we discuss how to set reliability goals, partnering with product, appreciating the customer, and embracing risk.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store