Availability & Reliability — how cloud changes the game — Part 2, Key Concepts

Iain Sinclair
google-cloud-jp
Published in
8 min readMar 17, 2020

日本語版はこちら

Part 1 is here (EN)

TL;DR

As we transition to cloud we need to adapt our way of thinking about availability and reliability in order to get the most out of the cloud platform and achieve the kind of continuous availability that internet services such as Google’s search, gmail, and many others practically achieve.

Introduction

This post is the second in a series that covers the following topics:

  1. Background and the goal (previous post)
  2. Key concepts (design patterns, SRE, etc — this post)
  3. GCP Services for high availability

Many factors affect reliability

The focus areas for this post are:

  • Architecture
  • Design patterns
  • Platforms & frameworks

I’ll also briefly cover things like operations, change and the network.

Some things that can also have a big impact on availability and reliability but that are not in scope for this post are hardware, security, natural disasters, human factors, bugs, race conditions and so forth.

Basic Concepts

In the traditional view of availability we talk about the number of 9’s availability target or SLA. Availability is considered in a simplistic, binary way -the system either up or down and no consideration is given to the ideas of partial availability, degraded performance or user experience.

Additionally, we consider planned and unplanned downtime separately and planned down time does not count against the target.

Some common availability targets expressed in 9’s and their allowed downtime

You could define available in a more meaningful way though. For example, if you have 100 users and 20% of them could not work for 10 minutes that might count as 2 minutes against the SLA.

And what about planned downtime? Well, that is going to be a thing of the past.

Site Reliability Engineering (SRE)

Site reliability engineering is what you get when you employ software engineers to run your operations. A key point in this is the automation of systems operation tasks.

According to SRE practice, we consider availability in an aggregate way.

  • Aggregate availability: successful requests / total requests
  • SLIs (Service Level Indicators): errors, latency, and throughput
  • SLOs (Service Level Objectives): based on SLIs, stricter than the SLA, they determine the Error Budget

Reference: https://landing.google.com/sre/books/

The Error Budget is a key concept — it is represents the trade off between velocity (speed of development and deployment) and reliability.

Tracking the SLI over time

EDIT: note that if this is really MTD then the red line cannot go up and down, it would only ever be flat (at best) or move downwards until the start of the next period when it gets resest. A better practice is to use a 28 day rolling window for the SLO and monitor/alert based on a shorter (say 1~12 hours) rolling window so as to be able to quickly react to a trend that would blow out the SLO in the 28 day window :)

In the example above:

  • SLI: % of responses < 20ms & success response
  • SLO: 99.95%
  • Error budget: 0.05% of requests in period (100% - SLO%)
  • SLA: 99.9%

If, observing how the SLI is tracking against the SLO, you realize that you have already, or are about to break the SLO then it is time to fix the issue.

This is how the the Error Budget represents the trade off between Velocity & Reliability: if you are within the SLO (ie within the error budget) then keep releasing new features otherwise you are required to stop releasing new features and fix the problems that cause the error budget to be consumed.

Canary Deployments

Canary deployments are a great tool for increasing availability. They are at the core of how Google manages releases for its products and services and a major contributing factor to reducing or eliminating the need for planned downtime and reducing the risk of making releases in general.

A canary deployment introduces a new version gradually without the need for downtime:

N is the % of traffic each cluster receives

In the above illustration, myapp is being upgraded from v1 to v2. Two new clusters are created for the v1 baseline and the v2 canary with identical resource allocation. They each get the same N% of traffic diverted to them and the canary is assessed. N is increased in steps until the decision is made to send 100% of traffic to the new version.

Some important considerations include:

  • Have agreed success criteria for the release
  • Lookout for unexpected changes in resource usage, error rate, etc
  • Increase N in steps if all OK, such as 1%, 5%, 10%, 20%
  • The baseline has the same resource allocation as the canary in order to allow direct comparison of resource consumption and error rates
  • It should always be possible to go back to the old version even after a full cutover to the new version

Canary deployments fit well with the idea of aggregate availability — you may be able to stay under the error budget and continue to meet the SLO if 1% of your users experience issues for 10 minutes during a failed canary release. However, if that was 100% of your users you probably blew it.

Cascading Failures

A cascading failure is a failure that grows over time as a result of positive feedback and lack of negative feedback- i.e., a snowball effect. Typically, a failure in one component causes additional pressure on other parts of the system, causing further failures.

A simple cluster of 2 nodes

In the simple example above, there is a cluster of two servers each operating at 60% utilization. If one of these servers fails, the other will be overwhelmed and the cluster is down.

Reference: https://landing.google.com/sre/sre-book/chapters/addressing-cascading-failures/

It gets more complex though, this may have been a service consumed by other components — how do they react to the situation of this cluster being unavailable?

A more complex example

In this case we have a a business critical common service, such as a payment service that is dependent on a number of external or “legacy” services with lower reliability.

Scenario:

  1. External Service 1 is experiencing problems — it is taking a very long time to respond and may even be silently dropping requests.
  2. The mission critical Common service is implemented naively, it holds connections to the external service for a long time
  3. The client is not waiting, it keeps generating retries to the Common service which it attempts to pass on to External Service 1
  4. The Common service runs out of resources (such as pooled connections) and fails

Now we can’t access the other external services that are working fine because the common service is dead.

In essence, the problem is the lack of brakes on demand by the client. If we assume that the external service cannot be improved then the Common service needs a way to signal to the Client that it cannot accept more requests while implementing ways to free up or recycle its resources used to communicate with the external services.

The problem of cascading failures becomes exponentially worse as the number of services and dependencies increases。

Next we’ll look at some design patterns that can be used to prevent cascading failures from causing catastrophic failure.

Design Patterns

There are several design patterns that are useful for preventing cascading failures. We’ll look at four of them:

  1. Circuit Breaker
  2. Exponential backoff
  3. Fail fast
  4. Handshaking

1. Circuit Breaker

Circuit Breaker

In the above diagram, the circuit breaker component is responsible for things like:

  • Error rate tracking
  • Timeout based on some 99.?th percentile
  • Clean up
  • Fallback alternatives, such as to a cache or asynchronous service
  • Error response
  • Logging

The Circuit Breaker needs a way to determine if an error is an isolated case or the rate of errors has exceeded a threshold. When this happens the circuit is “opened” so no more requests go through for a while. Backoff and try a few requests later.

The timeout should not be too long — avoid holding up resources and causing cascading failures.

Fallback alternatives may be things like a cached response or dropping a message on asynchronous message service to be processed later.

As such, Circuit Breakers handle the dirty work of making client-service communications more robust and separates this concern from the core application code.

2. Exponential backoff

Implemented at the client in order to try to avoid tying up resources and increasing pressure on a struggling service. The sequence of events looks like this:

  1. Make a request
  2. Fails → wait 1 second + random jitter, retry
  3. Fails → wait 2 seconds + random jitter, retry
  4. Fails → wait 4 seconds + random jitter, retry
  5. … up to a maximum_backoff time.
  6. Until retries = maximum_retries
Example of gmail backing off when the network is down

Random jitter just makes sure that the potentially thousands of clients that just got an error don’t all retry at exactly the same time and really break things.

3. Fail Fast

When responding to a request:

  • Best case: respond within SLO
  • OK case: exceed response time SLO but respond successfully
  • Bad case: take a long time to respond with failure
  • Worst case: never respond

The recommended best practice is to set an appropriately short limit on the time for own response.

4. Handshaking

This is based on an agreement between a client and a service, such as:

  • Response time SLO by service
  • When and how client should back off

For example,

  • Overloaded service responds with 429 Too Many Requests or 503 Service Unavailable
  • Client performs exponential back off

Notes

HTTP 5xx and 429 response codes are typical of cases requiring backoff on GCP services.

  • 429 Too Many Requests
  • 503 Service Unavailable

When dealing with a highly available, distributed service such as provided by the cloud platform, isolated 503 responses should not be interpreted as the service being down. Make the determination of “down” based on aggregation of failures over a set period of time.

The Google HTTP Client Library for Java provides an easy way to retry on transient failures.

Summary

In this post we examined the SRE way of looking at availability and how the idea of an error budget guides the tradeoff between velocity and reliability. We also examined cascading failures, what causes them and how to prevent them with some handy design patterns.

Support for key concepts like these are baked into the platform and SDKs.

In the next post we’ll build on this and take a look at some of the GCP services most commonly used in mission critical and customer facing systems through the lens of availability and reliability.

Recommended reading: Release it! 2nd Edition - Michael T. Nygard https://books.google.com/books/about/Release_It.html?id=md4uNwAACAAJ

Be sure to check out the many great articles on the below publications:
(JP) https://medium.com/google-cloud-jp
(EN)https://medium.com/google-cloud

--

--

Iain Sinclair
google-cloud-jp

Customer Engineer at Google Cloud & certified Google Cloud Professional Architect