Google Cloud Architecture Framework : Reliability

Amey Laddad
7 min readFeb 23, 2024

--

This is the fifth article in my seven-part series on the Google Cloud Architecture Framework.

Article 1: Google Cloud Architecture Framework Overview
Article 2: Google Cloud Architecture Framework : System design
Article 3: Google Cloud Architecture Framework : Operational Excellence
Article 4: Google Cloud Architecture Framework : Security

A reliable cloud solution is one that continuously meets user and business operations demands for performance, availability, and data integrity at the expected levels. It offers a smooth and stable computing environment for both users and enterprises. It does this by combining cutting-edge infrastructure, robust architecture, strict security measures, proactive monitoring, and quick support.

“Reliability is the precondition for trust.” — Wolfgang Schauble

This image is generated using DALL-E

The Architecture Framework describes best practices, provides implementation recommendations, and explains some of the available products and services.

Everyone in engineering, including the development, product management, operations, and site reliability engineering teams, is accountable for reliability. Everyone needs to take responsibility and be aware of the reliability goals, error budgets, and risk associated with their application. Teams should be able to escalate disputes between product feature development and dependability and prioritize tasks accordingly.

Important Terminologies

Reference document for these definitions: Link

Service Level: an assessment of how successfully a certain service completes the task that the user expects it to. Depending on what the service does and what the user expects or is informed it can accomplish, you may quantify this in a number of ways and characterize it in terms of user pleasure.

Critical user journey (CUJ): a set of interactions a user has with a service to achieve a single goal — for example, a single click or a multi-step pipeline.

Service level indicator (SLI): a gauge of user happiness that can be measured quantitatively for a service level.

Service level objective (SLO): the level that you expect a service to achieve most of the time and against which an SLI is measured.

Service level agreement (SLA): a description of what must happen if an SLO is not met. Generally, an SLA is a legal agreement between providers and customers and might even include terms of compensation. In technical discussions about SRE, this term is often avoided.

Image Courtesy: defining-SLOs-graph.png (610×271) (google.com)

Reliability Core Principles

These are self explanatory principals which are very important to follow to create a Reliable Cloud based solution. We’ll discuss about them in detail in the later part of this article.

● Reliability is your top feature.
● Reliability is defined by the user.
● 100% reliability is the wrong target.
● Reliability and rapid innovation are complementary.
● Define your reliability goals.
● Build observability into your infrastructure and applications.
● Design for scale and high availability.
● Create reliable operational processes and tools.
● Build efficient alerts.
● Build a collaborative incident management process.

Define your reliability goals

It is the first step towards creating a reliable solution to establish suitable methods for gauging the quality of service received by customers so that you can maintain dependable operations.

Goals provide direction, motivation, and focus. By setting clear, measurable, achievable, relevant, time-bound, challenging, flexible, aligned, personalized, and reviewed goals, one can increase their likelihood of success and fulfillment.

Important aspects to consider while setting goals are:

A) Choose appropriate SLIs:

Selecting the right service level indicators (SLIs) is crucial if you want to know exactly how well your service works. For example ….

⮚ The following SLIs are typical in systems that Serve data:
● Availability
● Latency
● Quality

⮚ The following SLIs are typical in systems that Process data:
● Coverage
● Correctness
● Freshness
● Throughput

⮚ The following SLIs are typical in systems that Store data:
● Durability
● Throughput and latency

B) Choose SLIs and set SLOs based on the user experience:

Set your SLO just high enough that almost all users are happy with your service, and no higher. Because of network connectivity or other transient client-side issues, your customers might not notice brief reliability issues in your application, allowing you to lower your SLO.

If you can’t measure the customer experience and define goals around it, you can run a competitive benchmark analysis.

C) Iteratively improve SLOs:

Revisit SLOs quarterly, or at least annually, and confirm that they continue to accurately reflect user happiness and correlate well with service outages. Make sure that they cover current business needs.

Define SLOs

An SLO is a target level of reliability for a service.

Service Level Objectives (SLOs) are crucial tools for assessing, defining, and enhancing service performance and dependability. In the end, they contribute to an organization’s overall success by promoting continuous improvement, supporting risk management, enhancing customer happiness, and coordinating technical and commercial goals.

An SLO is composed of the following values:

SLI: For example, the ratio of the number of responses with HTTP code 200 to the total number of responses.
Duration: The time period in which a metric is measured. This period can be calendar-based.
Target: For example, a target percentage of good events to total events (such as 99.9%) that you expect to meet for a given duration.

An effective strategy would be to create SLOs that center on the most significant user interactions with the product, which is a collection of services. So, to develop an effective SLO its important that we should consider Critical user journeys & Availability.

Choose an SLI

You need a measurement to establish whether a SLO is successful. The term “Service Level Indicator” (SLI) refers to the measurement. A SLI gauges the quality of a certain service you provide to your client. A recognized CUJ should ideally be linked to the SLI.

Metrics can be of multiple types. We need to select one which fits our use case the best !

Counter: For example, the number of errors that occurred up to a given point of measurement.
Distribution: For example, the number of events that populate a particular measurement segment for a given time period.
Gauge: For example, the actual value of a measurable part of the system. This type of metric can increase or decrease.

How to decide whether a SLI metric is good ??

Image Courtesy: defining-SLOs-bad-versus-good.png (2500×780) (google.com)

Like described in the image above, In the case of the bad SLI, the user unhappiness doesn’t correspond directly with a negative event such as service degradation, slowness, or an outage. Also, the SLI fluctuates independently of user happiness. With the good SLI, the SLI and user happiness correlate, the different happiness levels are clear, and there are far fewer irrelevant fluctuations.

A good SLI metric has the following characteristics:

● The metric directly relates to user happiness.
● Metric deterioration correlates with outages.
● The metric provides a good signal-to-noise ratio.
● The metric scales monotonically, and approximately linearly, with customer happiness.

SLO burn rate & Alerts

The rate at which an outage exposes users to errors and exhausts the error budget is known as the SLO burn rate. You can find out how long until a service breaches its SLO by taking a measurement of your burn rate.
An effective strategy is to send out alerts based on the SLO burn rate.

If 100% of requests fail during the specified interval, the time required to surpass an objective is displayed in the following table:

Table Content Courtesy: SLOs and alerts | Cloud Architecture Center | Google Cloud

When to Alert !!

When to take action depending on your SLO burn rate is a crucial question. In general, you should page someone to address an issue right away if you would use up all of your error budget in less than a day.

Failure rate measurement is not always easy to do. A string of minor mistakes could appear terrible at first, but they will eventually pass and won’t have much of an effect on your SLO. In a similar vein, a system’s modest malfunctions over an extended period of time may accumulate to a SLO violation.

This diagram below shows how one can adopt to the SLO Burn Rates:

Image Courtesy: adopting-slos-burn-rate.png (1149×659) (google.com)

The following table includes a GCP suggested baseline set of SLO alerts.

Table Content Courtesy: SLOs and alerts | Cloud Architecture Center | Google Cloud

Design for scale and high availability

A reliable service continues to respond to customer requests when there’s a high demand on the service or when there’s a maintenance event.

Reliability design principles and best practices:

● Implement exponential backoff with randomization in the error retry logic of client applications.
● Implement a multi-region architecture with automatic failover for high availability.
● Use load balancing to distribute user requests across shards and regions.
● Design the application to degrade gracefully under overload.
● Establish a data-driven process for capacity planning, and use load tests and traffic forecasts to determine when to provision resources.
● Establish disaster recovery procedures and test them periodically.

This image is generated using DALL-E

For cloud computing environments to be successful and credible, reliability is essential. Cloud providers can satisfy the changing needs of users and organizations by delivering constant performance, availability, and data integrity by prioritizing reliability and investing in strong infrastructure, technologies, and processes.

This was the fifth article in my seven-part series on the Google Cloud Architecture Framework. We will go into more detail about the remaining 2 pillars in my upcoming articles, along with an understanding of some best practices for creating and managing a well-architected framework on GCP.

Thank you for reading this article. Your time is appreciated.
Until next time, stay curious !!

--

--