Creating a Dashboard with Stackdriver SLI Monitoring Metrics

Published in

Google Cloud - Community

7 min readAug 4, 2018

If you really want to know how reliable your service is, you must be able to measure the rates of successful and unsuccessful requests. A Service-Level Indicator (SLI) is a direct measurement of a service’s behavior and can be used when setting a Service-Level Objective (SLO).

When you build your applications on a cloud infrastructure like Google Cloud Platform, you want to build the app so that the operations (DevOps or SRE) teams are able to monitor your application effectively. Stackdriver Transparent SLIs provide detailed metrics for over 130 Google Cloud services and report those metrics as experienced from your individual project(s). These SLI metrics can be used in Stackdriver Monitoring dashboards, along with other relevant metrics for your applications, to help speed up your operations teams and their root-case analysis.

Using Stackdriver Transparent SLI metrics, you can evaluate the following metrics for each service:

Service name
Method
API version
Credential ID
Location
Protocol (HTTP / gRPC)
HTTP Response Code (e.g. 402)
HTTP Response Code class (e.g. 4xx)
gRPC Status Code

These are fine-grained metrics reported for your specific project which give you detailed insight into your application’s usage of Google Cloud services.

So, how does this help you with observability? Put simply, Stackdriver Transparent SLIs give you the ability to view the interactions between your software and Google Cloud services. You can more easily determine whether your application code is the root cause of an issue or whether the Google Cloud services are part of the root cause.

I recently built a dashboard for an app which included metrics using the Stackdriver SLIs. The app included Cloud Functions logic, communication via Pub/Sub, calls to the Cloud Vision and Video Intelligence APIs and a BigQuery dataset. I explored the Stackdriver SLI metrics and and then used them to create a monitoring dashboard using Stackdriver Monitoring. Here are the steps that I used to evaluate which metrics to use and then to create the dashboard for my app.

Step 1. Deploy the app.

I deployed this app which includes a series of 4 Cloud Functions which call the Vision and Video Intelligence APIs for images and videos, respectively.

The app is a backend for processing image and video files which classifies them as explicit or not. The functions are orchestrated using Cloud Storage notifications and Cloud Pub/Sub. The results of the Vision and Video Intelligence APIs are stored in BigQuery for analysis.

Step 2. Select the metrics for the dashboard.

The Site Reliability Engineering (SRE) book (which you can read online for free) talks in depth about SRE practices. Section 6 of the book covers monitoring distributed systems and describes the 4 golden monitoring signals as the following:

Latency
Traffic
Errors
Saturation

The Stackdriver SLI metrics provide request latency, request count, request sizes and response sizes for GCP service calls. These SLI metrics cover the latency, traffic, error and saturation metrics described by the SRE book. This provides good coverage for the Google Cloud services being used by my app.

In addition to the Stackdriver SLI metrics, I also wanted to monitor the service-specific metrics for the app components such as Pub/Sub, Cloud Storage, BigQuery and Cloud Functions. Taken together with the Stackdriver SLI metrics, these metrics provide deep insights into the behavior of the app.

Step 3. Explore the metrics for the app services.

I used the Stackdriver Metrics Explorer service in the Stackdriver Monitoring UI to inspect the metric details available for each of the Stackdriver SLI metrics.

First, I looked at the request latency metric by entering the Resource Type “Consumed_api” and selecting the metric “Request latencies”. I grouped the metrics by service to display the individual latencies and then selected the Aggregation of “99th percentile”.

I also grouped the service by method to look at the individual service call methods and their latency. This is a useful metric to explore if an overall service such as cloudfunctions.googleapis.com has a high latency. Breaking it out by method allowed me to see the latencies for each specific method such as google.cloud.functions.v1.CloudFunctionService.UpdateFunction.

Next, I explored the request count metric. I selected the Resource Type “Consumed_api” and selected the metric “Request count”. I grouped the metrics by “service” to display the individual counts and then selected the Aggregation of “sum”. This provided the rate of requests received by the various services and aligned well with the traffic metric for the app that I wanted to monitor.

To get the error metric, I used the same Resource Type of “Consumed_api” and metric of “Request count”, but this time I added a filter for “response_code != 200” to look at traffic that returned error status codes.

To see the GRPC errors, I used the same graph, but substituted “grpc_status_code !=0” for the filter and then grouped by “grpc_status_code”. Both are useful to review the errors returned from the GCP service calls.

This exploration provided insights into the SLI metrics that I would include in my dashboard. I also did a similar exploration for the Resource Types that I knew were included in my project including the following resources:

cloud_function
pubsub_topic
gcs_bucket
bigquery_dataset

Step 4. Build the dashboard.

Now that I had explored the metrics, I had a good sense for the metrics that I would include on the charts in my dashboard. I then started a new dashboard in the Stackdriver Monitoring UI by selecting “Dashboard => Create dashboard”.I clicked the “Add Chart” button to add each chart.

Request rate

The first chart that I added was the SLI monitoring request count. The purpose of this chart was to display the request rate for the Google Cloud services. This chart may be useful to understand which services are generating higher call rates.

Resource Type: Consumed_api
Metric: request_count
Group By: service
Aggregation: sum

Request latencies

The purpose of this chart was to display the latencies for the Google Cloud services. This chart may be useful when correlated with the overall latency of your application to help identify whether latency in Google Cloud services are causing or contributing to your overall app latency.

Resource Type: Consumed_api
Metric: request_latencies
Group By: service
Aggregation: 99th percentile

Error rate

The next chart that I added covered the error response rates using the SLI monitoring metric request count, but filtered for error requests. For this chart, I added 2 different metrics to cover both HTTP response codes and GRPC response codes. The purpose of this chart was to display the error rates for the Google Cloud services. This chart may be useful to understand whether your apps are generating errors based on failed calls to Google Cloud services.

HTTP error responses:

Resource Type: Consumed_api
Metric: request_count
Filter: response_code_class !=2XX
Group By: service, response_code
Aggregation: sum

GRPC error responses:

Resource Type: Consumed_api
Metric: request_count
Filter: grpc_status_code !=0
Group By: service, grpc_status_code
Aggregation: sum

Response size

The last SLI monitoring metric that I included was the response size. The purpose of this chart was to display the response sizes for the Google Cloud services. This metric may be useful in debugging high latency in your app services and identifying which parts of your app may be optimized to handle larger results.

Resource Type: Consumed_api
Metric: response_sizes
Group By: service
Aggregation: 99th percentile

That’s it for the Stackdriver SLI metrics on the dashboard. I also added the Stackdriver monitoring metrics for Pub/Sub, Cloud Storage, Cloud Functions and BigQuery. You can check out the details for these metrics on github. Here’s the full dashboard for reference.