Getting started when forming a reliability program on Google Cloud

Published in

Google Cloud - Community

5 min readFeb 23, 2021

I have observed when on-Premises companies chose to migrate or extend into the cloud, a part of that choice was to increase reliability but the execution to obtain that reliability is fragmented. To assist in implementing a holistic approach for reliability, a reliability program should be formed to establish a model of what a reliable application should look like. In working with our Google Cloud customers I have seen these patterns arise when establishing a successful reliability program.

Establish a minimal viable reliability (MVR)

A majority of older applications are not fully designed as a cloud native application and still have monolithic design, but that doesn’t mean you can’t take advantage of the added reliability that Google Cloud offers. In order to take full advantage of a cloud reliability program you should establish what are considered the minimal requirements to be highly available on the cloud. These requirements should be at or higher than the requirements that were on-prem. Depending on your SLAs and organizational objectives, minimum configuration requirements should be set for the below topics:

Zonal and regional deployment
Load balance traffic
No single point of failure
Backups
High availability design
Detect failure
Capacity management

Examples of requirements are, Load balance at all tiers, an application must be multi regional, schedule backups and define RTO & RPO, respond to incidents in X time, eliminate bottlenecks and predict peak traffic events. This is not an exhaustive list but should serve as examples to identify the minimum reliability requirements for launching on Google Cloud.

Have a scoring matrix of what a top reliable application looks like

In order to establish a long term reliability model for an application creating a scoring matrix should be designed in order to identify areas for reliability improvements. The scorings should include weighted metrics based on what is important to the company. The items should include best practices for configuration of services. Include design patterns, like load balancing, horizontal scaling, aiming for stateless. Add in operational processes like using infrastructure as Code (IaC), progressive rollout, automate emergency responses, and eliminate toil. For more details on examples of what to include in your scoring matrix an example of a Google best practices documentation and documents on reliability are below.

The end result of having a scoring matrix is to identify how deep reliability patterns have been put into your application and areas of further improvement if wanting to increase its reliability.

Find the real SLA for an application not the observed SLA

Quite a few times I have heard companies say that they have a five 9’s application but when reviewing the design the components used and the SLA’s for those components the math isn’t possible to reach five 9’s. (E.G. a single region deployment with interdependent components) Instead what people are often referring to is the observed SLA which is obtained by sheer luck or some really good engineers.

In order to determine the real SLA, identify the SLA for each of your components and then determine if they are interdependent or independent of other components. An example for this is using a Virtual Machine in a single zone connecting to Cloud SQL. If we complete the math from the formulas below (99.5% * 99.95%) /100 you will see that we obtain a 99.45% SLA. If we doubled that model in another region not including a global balancer that math would look like this 100- ((100–99.45%) * (100–99.45%))=99.6975%.

There are an infinite amount of patterns that can occur and my goal here is to give examples on how to use the formulas to calculate the real SLA, in order to realize how reliable the application was actually built. The below chart gives the on average targets depending on the service.

Interdependent Formula = (SLA% * SLA%) / 100

Independent Formula = (100-(100-SLA%) * (100-SLA%))

Adopt site reliability engineering (SRE) tooling

While not all companies can form a full SRE team because of the cost, there is a high value to using the methods and tools that SRE provides. Two of the great tools the SRE provides is identifying the SLO/SLI for an application and running through a risk analysis. Defining your SLO/SLI will help you understand what really matters for the application’s reliability. It’s going to identify the metrics that matter to the people you serve and not that the server is up for 24 hours a day (Availability). Take a looks at Google’s SRE book in order to assist with identifying SLO/SLI metrics. The last tool from the SRE playbook that I find very effective is creating a risk analysis. This Risk Analysis sheet helps you to identify a catalog of risks to reliability and identifies if you can accept certain risks within an error budget according to your defined SLA. If you can’t accept the risk then you need to increase your error budget or find a way to remove the risk. For more information on SLO’s and error budgets refer to Google’s SRE workbook.

Identify reliability champions spread throughout the company

Reliability champions are needed to support the execution of reliability. They are inherently closer to the design and deployment of an application. These champions should consist of architects, engineers or SRE’s and should know the best practice configurations to enable reliability. They would be the members that contribute to the reliability scoring matrix in order to identify the best possible solution.

Create a governance board for reliability

Create a governance board to establish a reliability framework, a set of rules/processes as relates to reliability. This team will be a diversified body of engineers and business leaders. Their sole responsibility is the management items identified above in order to work them into the processes of the company. The execution of the process can be completed by this group but it can also fall onto reliability champions, product owners, architects and engineers. This body is to establish the requirements for the minimal viable reliability and the long term goals for a reliable application. Both of these requirements should be aligned with the outcome of the business objectives.

While there are many other processes and topics that could be introduced into a reliability program, these seem to be recurring themes. I hope these concepts will help you on your way to increase reliability on the cloud.