Measuring Reliability in GCP: Step By Step SLO creation guide using Cloud Operation Sandbox

Published in

Google Cloud - Community

10 min readSep 10, 2021

In this step-by-step guide, I will demonstrate how to configure SLOs in Cloud Operations using our learning environment, Cloud Operation Sandbox.

After a short terminology introduction, I’ll introduce Cloud Operation Sandbox. Then, we’ll create SLOs for a web-based e-commerce app, Hipster Shop. Lastly, we’ll learn how to configure the SLOs we defined in GCP’s Cloud Operations suite.

If you are not familiar with SRE’s concepts and terminology, please review SRE fundamentals (SLIs vs SLAs vs SLOs) before moving to the next section.

In SRE we use a customer-centric approach to drive activities in all aspects; from determining metrics, to capacity planning, and change management. As long as your users are happy, you can prioritize velocity, but when you are perceived as unreliable, you should prioritize reliability. In order to improve our users’ experience, you must first define SLIs and SLOs.

SLI -Metrics that describe users’ experiences. SLO- Targets for the overall health of a service. SLA- Contractual obligations — SLI, SLO, SLA recap

SLI, SLO, SLA recap

The scope for SLIs and SLOs is a User journey. Your users are using your service to achieve a set of goals, and the most important ones are called Critical User Journeys (CUJ).

User Interacts with a Service to Achieve a goal.

Creating SLI/O Step By Step

For this guide, we’ll use Cloud Operations Sandbox, a click-to-deploy, open-sourced learning environment. Cloud Operations Sandbox helps practitioners gain an understanding of how to use Google Cloud’s Operations Suite, and provides the capability to apply SRE practices in an isolated cloud environment with synthetic traffic similar to a production environment.

Cloud operations sandbox is a quick and easy `playground environment` to evaluate Cloud Operations as close as possible to a real-world production environment. Start here: cloud-ops-sandbox.dev

Cloud Operations Sandbox includes a cloud-native microservices demo application called Hipster Shop.

PrintScreen of the Hipster Shop homepage.

Hipster Shop is a web-based e-commerce application with a frontend and a series of backend services. The application consists of 11 microservices written in various languages that talk to each other over gRPC. The Hipster Shop demonstrates the use of technologies like Kubernetes/GKE, Istio, cloud monitoring and logging, and gRPC.

Users of Hipster Shop can browse items, add them to the cart, and purchase them. To develop SLIs it’s important to understand how the application is built.

To start with, let’s review the Hipster shop’s flows:

Users access the application through the Frontend.
Then purchases are handled by CheckoutService.
The CheckoutService depends on CurrencyService to handle conversions.
Other services such as RecommendationService, ProductCatalogService, and Adservice are used to provide the frontend the needed content to render the page.

For more information please refer to the public repo: github.com/GoogleCloudPlatform/cloud-ops-sandbox

SLO Creation Process

To ensure user happiness, you have to define it first. Our users are using our service to achieve a set of goals, the most important of which are the Critical User Journeys (CUJs).

After you have CUJs defined, you want to identify the metrics which most closely describe the user experience, these metrics are our SLIs. You will then use those SLIs to define our SLO targets.

1. SLO Process -Identify our CUJ

Step 1: List the critical user journeys and order them by business impact : checkout. Add to Cart and Browse products.

A few of the critical actions that users can do as part of shopping on Hipster Shop are:

Browse products,
Check out,
and Add to cart.

Next, you should ask yourself, which are the most important to the business? And use this as the prioritization criteria for CUJs.

Checkout is not the only critical user journey, but it is tied directly to the businesses revenue, so we have prioritized it first.

If you find it challenging to prioritize CUJs, try to put yourself in the user’s shoes. In our example, as a user, how many times have you browsed or added items to a shopping cart online without purchasing the items?

CUJ: As a shopper I want to see purchase (checkout) items in the store.

For the rest of this example, ‘Checkout’ will be the basis of the SLOs, but you could apply the same techniques to other critical user journeys.

The Checkout CUJ uses the following components of our architecture

2. SLO Process -SLI creation

After you identify the CUJ and the interaction you will focus on, you’ll want to choose which metrics will represent the user experience most accurately.

Given that our application, Hipster Shop, serves end-user e-commerce traffic, the users’ experience when performing different actions on the frontend during checkout should remain consistent. For that reason, we should define SLIs for request availability (how many requests are successful), latency (how long a request takes), quality, and other indicators.

When deciding what SLI types to use, I recommend reviewing the SLI Menu from The Art of SLOs handbook, page 7. The handbook has a handy guide for drafting SLIs depending on the type of interaction: request/response, data processing, or storage.

SLI Menu: Request/Response, Data Processing, Storage

The next step is to define the SLI Specification. The SLI Specification is an assessment of a service outcome that you think matters to your users. You will want to represent those as a proportion of: the number of `good` events divided by the total number of valid events.

SLI Specification : For availability: The proportion of valid events served successfully

And lastly, you will refine that specification into a detailed SLI implementation by adding a measurement method, for example, Application-level Metrics, Logs Processing, Front-end Infra Metrics, Synthetic Clients/Data, Client-side Instrumentation (To read more, read: The Art of SLOs handbook, page 16).

SLI Implementation : event + success criteria + where/how you record the SLI.

In the case of the checkout CUJ, the goal is for the checkout functionality to maintain acceptable availability. To achieve that, you need to measure how many users tried to check out, and how many of those requests succeeded, so the number of successful requests is the ”good” metric.

It’s important to detail what specifically is going to be measured and where you are planning on measuring it.

The proportion of HTTP GET requests for /checkout_service/response_counts that do not have 5XX status (3XX and 4XX excluded) measured at the istio service

Why are 3XX and 4XX status codes excluded? Since 3xx and 4xx indicate redirects and client-side errors, in this specific case, you can treat these as “the system working as intended”. Therefore you don’t want to consider these types of requests as errors against our availability SLO.
Please note, that this shouldn’t be a generalization, you should be mindful of making this decision, and it is recommended to review these errors in your system to check if they represent a system failure.

3. SLO Process — SLO

After defining the SLIs, it’s time to set the SLO (Service Level Objective), the target for the SLI during a specified time window, for example, a month or a quarter.

The SLO should include a target that is represented as a percentage, a ratio of good/total, for example, 99.99%. In a specific measurement window, that window should be a duration that will allow us to make strategic decisions and prioritize reliability when needed (To learn more, read: Choosing an appropriate time window).

Step 3: SLO Creation: 1) Determine SLO target goals 2) Determine SLO measurement period

If you want to make changes to your system based on tracking these SLOs, it is important to set an achievable target. If your target is overly ambitious, the SLOs will quickly become an ignored nuisance. If your customers are happy with your service today, you’re probably doing OK, and you might want to base your targets on historical data that align with your users’ expectations. Those targets are not set in stone, and you should revisit those targets and iterate on them to align with the business requirements.

In our case, you can assume our SLO will be 99% availability, according to the historical data trends. If you do not have historical data, try to check competitors and benchmark for the type of journey in question to examine the user’s expectations. You might consider breaking up journey phases, for example, if a manual action is needed between interactive services versus report generation (For more: refer to the Buy-In currency CUJ in the The Art of SLOs handbook, page 20).

99% of Checkout requests in the past 28 days are successful

Step 4:

After you have defined your SLIs and SLOs, It’s time to turn those definitions into tangible dashboards and alerts. In GCP, our tool to implement SLIs and SLOs is Cloud Operations Suite.

Implementation using Cloud Operations Suite

Cloud Operations Suite provides service-oriented monitoring, which means that you are configuring SLIs, SLOs, and Burn Rate Alerts for a ‘service’.

The first step in order to create a SLO is to ingest the data related to metrics for usage in SLIs, for GKE services it comes out of the box, but you can also ingest additional data. Then you need to define our service, define our indicator: the SLI, our target: the SLO, and lastly: the Burn Rate Alert.

Steps: 1: Define Service, 2:Define SLI, 3: Define SLO, 4: Define Alert.

One of the advantages of using Cloud Operations Sandbox is that it has integrated lots of different observability metrics from the Microservices in Hipster Shop out-of-the-box with Cloud Operations Suite. You can find all the Hipster Shop microservices under monitoring → services → Services Overview:

For our use case, you need to choose the checkout service. As you can see, there are two checkout services. The reason for that is because Cloud Operations Sandbox’s services are usingIstio. Services using Istio are automatically detected and Cloud Operations Suite creates the services in Cloud Monitoring for us. However, in order to demonstrate how you can create your own services, Cloud Operations Sandbox also deploys custom services using Terraform (view Terraform configuration). Hence, we have two services, one auto detected because Istio is used and another defined in Terraform code that is explicitly created.

To explore the existing SLOs, choose the custom Checkout Service (this was created by the Terraform code in Cloud Operation Sandbox):

Services in Cloud Operations printscreen.

CheckOut service in Cloud Operations printscreen.

Next, you will choose the Checkout Service that is auto-detected (this was auto-detected because Istio was used):

Create SLO in Checkoutservice printscreen.

To configure the availability SLI and SLOs you covered in the previous section, you will choose → `Create SLO`.
In the first screen, you will choose your SLI Type: Availability

Create Service Level Objective screen: Set your SLI.

3. Then, you will choose the performance metric: Availability. Your SLI Implementation will be: service/server/request count, which is the percentage of successful HTTP GET requests for service/request_count, as measured by Istio. In this screen, you can also see historical data for that metric.

Create Service Level Objective screen: define your SLI.

4. Then, you should configure our SLO. In this case, it is 99% of Checkout requests in the past 28-days are successful. The 28-day rolling window is usually more aligned with the customer experience and is a good general interval to make strategic decisions and plan ahead (To learn more, read: Choosing an appropriate time window) :

Before saving, you can review the SLO and see what the JSON configuration should look like (that can be used as part of automation):

Update your SLO: 'Review and Save' screen.

5. After that, you can see the new SLO on the `checkout service` overview page. Lastly, you can create SLO burn rate alerts for it:

Checkout Service screen in Cloud Operations after creating SLO. Button to 'Create Alerting Policy' is now below the SLO.

In the next screen, you will set the alert’s condition, who to notify, how (email, webhook, etc), and any additional instructions to include:

After the alert is created, you can see it and any incidents that are being triggered by the SLO in the service screen:

Incident triggered in the service screen

In case an alert is triggered, you will be able to see it in the Alerting screen:

Incident triggered in the Alerting screen.

After you finish, you can see the service details, Alert timelines, SLOs, and additional information under the services screen:

Services Overview screen after creating SLOs.

Now that you have learned how to draft the checkout availability SLO and implement it in Cloud Operations. I would like to encourage you to draft and implement additional SLOs like a Latency SLO, additional CUI like Add to Cart, or even to leverage these learnings to draft and configure SLIs and SLOs for your own services.

A journey to SRE includes both implementing SRE practices like SLOs that are covered here, and fostering an enabling culture. Below, you can find additional resources, including a collection of publicly available resources by resource type and level.

Additional Resources :

SRE books (SRE, Workbook, Building Secure & Reliable Systems)
Collection of SRE Public Resources for GCP Customers
Coursera: Site Reliability Engineering: Measuring and Managing Reliability
SLO Monitoring in GCP’ Cloud Operations