Availability & Reliability — how cloud changes the game

Iain Sinclair
Dec 11, 2019 · 3 min read

日本語版はこちら

TL;DR

As we transition to cloud we need to adapt our way of thinking about availability and reliability in order to get the most out of the cloud platform and achieve the kind of continuous availability that internet services such as Google’s search, gmail, and many others practically achieve.

Introduction

This post is the first in a series that will go through the following topics:

  1. Background and the goal (this post)
  2. Key concepts (design patterns, SRE, etc)
  3. GCP Services for high availability
  4. About cloud platform APIs
  5. Wrap Up

Background

My interest in availability and cloud began when I came across a recording of a talk at GOTO 2012 by Adrian Cockroft, then lead cloud architect at Netflix. I think that this is still interesting today, even though modern cloud platforms and Netflix have made a lot of progress since then. Key points were:

  • Getting out of the datacenter business and focusing on business differentiation
  • Microservices architecture
  • Chaos monkeys!
  • Continuous integration and delivery
  • Contributing to open source

Very high availability was a top priority and they would take services and zones down in order to prove that the system could handle it.

Availability/Reliability and the User Experience

There are a couple of basic principles that underly this article:

  • Mission Critical Business Systems must be built with the failure of underlying infrastructure, platforms, frameworks and dependencies as a given and be verified to work in a wide variety of failure modes.
  • Reliability and availability should be considered from the user perspective ensuring that users are not impacted, even when something major has gone wrong

By the time that this series is complete, I hope that you will have a very clear idea how to adopt these principles in practical terms on GCP.

Traditional HA & BCP/DR

Traditional, on-premise High Availability architectures, Business Continuity Planning and Disaster Recovery may be characterised as follows:

  • Having two, Main + Failover, data centers
  • Over provisioning & idle infrastructure
  • Database log shipping
  • RPOs (Recovery Point Objectives) = accepted loss of data
  • RTOs (Recover Time Objective) = accepted time to failover, restore service
  • Regular downtime for updates and maintenance
The traditional way to do HA/BCP/DR

The end result is poor availability and high cost.

Cloud Native Availability

In contrast to this we have the idea of Cloud Native Availability:

Live updates: No downtime for upgrades, updates, this is a key differentiator of Google and GCP.

Pay for what you use, use what you pay for: Active redundancy built in, no wasted resources.

Graceful degradation of the User Experience: in the worst case

Examples of applications many of us use daily and just expect to be available all of the time

Is the goal 100% availability?

High availability has cost: Innovation and business advancement may suffer. It requires a large ongoing investment in time and money.

However, down time and instability have cost too: Innovation and business advancement may suffer when you are too busy owning the system. Also, reputation with users and customers can be damaged and overall, it causes stress on the team.

So if the answer is “not really”, then what is the goal?

The best tradeoff with Cloud Native

The goal is to make the best trade off between Velocity and Reliability.

Optimize the trade off between availability and speed, for a given cost

Note that cost is considered constant — invest what is needed in order to achieve the velocity and reliability your business needs.

The theory I will develop is that the best possible balance is achieved by going with Cloud Native design and “the best of Google”.

End of Part 1

This is the end of Part 1 “Background and the goal”. In Part 2 “Key concepts” I’ll go deeper into design patterns and touch on Site Reliability Engineering and zero-downtime deployment.

Google Cloud Platform…

Iain Sinclair

Written by

google-cloud-jp

More From Medium

More on Google Cloud Platform from google-cloud-jp

More on Google Cloud Platform from google-cloud-jp

Cloud Bigtable で位置情報を扱ってみる

More on Reliability from google-cloud-jp

More on Google Cloud Platform from google-cloud-jp

More on Google Cloud Platform from google-cloud-jp

NEG とは何か

19

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade