Availability & Reliability — how cloud changes the game

Published in

google-cloud-jp

3 min readDec 11, 2019

日本語版はこちら

TL;DR

As we transition to cloud we need to adapt our way of thinking about availability and reliability in order to get the most out of the cloud platform and achieve the kind of continuous availability that internet services such as Google’s search, gmail, and many others practically achieve.

Introduction

This post is the first in a series that will go through the following topics:

Background and the goal (this post)
Key concepts (design patterns, SRE, etc)
GCP Services for high availability

Background

My interest in availability and cloud began when I came across a recording of a talk at GOTO 2012 by Adrian Cockroft, then lead cloud architect at Netflix. I think that this is still interesting today, even though modern cloud platforms and Netflix have made a lot of progress since then. Key points were:

Getting out of the datacenter business and focusing on business differentiation
Microservices architecture
Chaos monkeys!
Continuous integration and delivery
Contributing to open source

Very high availability was a top priority and they would take services and zones down in order to prove that the system could handle it.

Availability/Reliability and the User Experience

There are a couple of basic principles that underly this article:

Mission Critical Business Systems must be built with the failure of underlying infrastructure, platforms, frameworks and dependencies as a given and be verified to work in a wide variety of failure modes.
Reliability and availability should be considered from the user perspective ensuring that users are not impacted, even when something major has gone wrong

By the time that this series is complete, I hope that you will have a very clear idea how to adopt these principles in practical terms on GCP.

Traditional HA & BCP/DR

Traditional, on-premise High Availability architectures, Business Continuity Planning and Disaster Recovery may be characterised as follows:

Having two, Main + Failover, data centers
Over provisioning & idle infrastructure
Database log shipping
RPOs (Recovery Point Objectives) = accepted loss of data
RTOs (Recover Time Objective) = accepted time to failover, restore service
Regular downtime for updates and maintenance

The end result is poor availability and high cost.

Cloud Native Availability

In contrast to this we have the idea of Cloud Native Availability:

Live updates: No downtime for upgrades, updates, this is a key differentiator of Google and GCP.

Pay for what you use, use what you pay for: Active redundancy built in, no wasted resources.

Graceful degradation of the User Experience: in the worst case

Examples of applications many of us use daily and just expect to be available all of the time

Is the goal 100% availability?

High availability has cost: Innovation and business advancement may suffer. It requires a large ongoing investment in time and money.

However, down time and instability have cost too: Innovation and business advancement may suffer when you are too busy owning the system. Also, reputation with users and customers can be damaged and overall, it causes stress on the team.

So if the answer is “not really”, then what is the goal?

The best tradeoff with Cloud Native

The goal is to make the best trade off between Velocity and Reliability.

Optimize the trade off between availability and speed, for a given cost

Note that cost is considered constant — invest what is needed in order to achieve the velocity and reliability your business needs.

The theory I will develop is that the best possible balance is achieved by going with Cloud Native design and “the best of Google”.

End of Part 1

This is the end of Part 1 “Background and the goal”. In Part 2 “Key concepts” I’ll go deeper into design patterns and touch on Site Reliability Engineering and zero-downtime deployment.