Availability & Reliability — how cloud changes the game
日本語版はこちら
TL;DR
As we transition to cloud we need to adapt our way of thinking about availability and reliability in order to get the most out of the cloud platform and achieve the kind of continuous availability that internet services such as Google’s search, gmail, and many others practically achieve.
Introduction
This post is the first in a series that will go through the following topics:
- Background and the goal (this post)
- Key concepts (design patterns, SRE, etc)
- GCP Services for high availability
Background
My interest in availability and cloud began when I came across a recording of a talk at GOTO 2012 by Adrian Cockroft, then lead cloud architect at Netflix. I think that this is still interesting today, even though modern cloud platforms and Netflix have made a lot of progress since then. Key points were:
- Getting out of the datacenter business and focusing on business differentiation
- Microservices architecture
- Chaos monkeys!
- Continuous integration and delivery
- Contributing to open source
Very high availability was a top priority and they would take services and zones down in order to prove that the system could handle it.
Availability/Reliability and the User Experience
There are a couple of basic principles that underly this article:
- Mission Critical Business Systems must be built with the failure of underlying infrastructure, platforms, frameworks and dependencies as a given and be verified to work in a wide variety of failure modes.
- Reliability and availability should be considered from the user perspective ensuring that users are not impacted, even when something major has gone wrong
By the time that this series is complete, I hope that you will have a very clear idea how to adopt these principles in practical terms on GCP.
Traditional HA & BCP/DR
Traditional, on-premise High Availability architectures, Business Continuity Planning and Disaster Recovery may be characterised as follows:
- Having two, Main + Failover, data centers
- Over provisioning & idle infrastructure
- Database log shipping
- RPOs (Recovery Point Objectives) = accepted loss of data
- RTOs (Recover Time Objective) = accepted time to failover, restore service
- Regular downtime for updates and maintenance
The end result is poor availability and high cost.
Cloud Native Availability
In contrast to this we have the idea of Cloud Native Availability:
Live updates: No downtime for upgrades, updates, this is a key differentiator of Google and GCP.
Pay for what you use, use what you pay for: Active redundancy built in, no wasted resources.
Graceful degradation of the User Experience: in the worst case
Is the goal 100% availability?
High availability has cost: Innovation and business advancement may suffer. It requires a large ongoing investment in time and money.
However, down time and instability have cost too: Innovation and business advancement may suffer when you are too busy owning the system. Also, reputation with users and customers can be damaged and overall, it causes stress on the team.
So if the answer is “not really”, then what is the goal?
The best tradeoff with Cloud Native
The goal is to make the best trade off between Velocity and Reliability.
Note that cost is considered constant — invest what is needed in order to achieve the velocity and reliability your business needs.
The theory I will develop is that the best possible balance is achieved by going with Cloud Native design and “the best of Google”.
End of Part 1
This is the end of Part 1 “Background and the goal”. In Part 2 “Key concepts” I’ll go deeper into design patterns and touch on Site Reliability Engineering and zero-downtime deployment.