Automating Application Dashboard Creation for Services on GKE/Istio

Yuri Grinshteyn
Google Cloud - Community
6 min readJan 7, 2020

Introduction

One of the more interesting concepts I’ve been hearing more and more about both from customers and from folks I respect and follow in the industry is the idea of “monitoring as code”. This is, of course, a subset of the “everything as code” movement, but it’s something that really resonates with me. Specifically, I have talked to lots of folks recently who are interested in automating the setup of the monitoring configuration when new services or new projects are rolled out. That generally includes two main things — alerting and dashboards. Automating the creation of alerting policies in Stackdriver has been available for some time through the use of the relevant API or Terraform. Now, the same automation is available for Stackdriver dashboards!

Dashboards

Credit to https://twitter.com/ronsoak

It has been my long-held opinion that dashboards in general are over-emphasized in monitoring because they don’t really help with problem detection or problem resolution. The former is better addressed via well-defined SLOs and alerting, and the latter via good observability and ad-hoc querying capabilities. Nevertheless, dashboards are widely used by nearly everyone involved in service reliability and are a must-have capability in any monitoring setup. Stackdriver dashboards are very popular, and most of my customer conversations include helping folks visualize data in one way or another. One of the most common questions I do field is essentially “what should go in our dashboard?”

Thankfully, the answer to that has been well documented in the Monitoring Distributed Systems chapter in the SRE book. If you have nothing else, start with the “golden signals” — traffic, errors, latency, and saturation.

We can take a basic set of services running in GKE and managed by the Istio service mesh as an example. I’ve previously written on the value of Istio for observability and monitoring and its integration with Stackdriver here. Separately, I created a tutorial for building a dashboard showing “golden signal” data from such a system using Grafana here. In general, we are after something that looks like this (note that I’m using the beta release of Monitoring in the Cloud Console here):

Services Dashboard in Stackdriver

Let’s take a look at how to create a dashboard focused on the “golden signals” of service health.

Application Dashboard

Request Rates

The first “golden signal” is “traffic” — essentially a measure of how much user activity the service is responding to. Thankfully, Istio provides this natively for every service in the mesh. The metric is called “server_request_count”, and the first chart in my dashboard, “Request Rates by Service”, uses that metric and groups the results by the “destination_service_name” label. Here’s its configuration in detail:

Request Rates Chart

Errors

The next chart, “Errors by Service” uses the same metric and grouping options; the only difference is that the data is then filtered to only count requests where the response code is not 200. This is a rough approach, as it does include 3xx redirects and 4xx errors, which are often the result of misconfigured or misbehaving clients, but it suffices to illustrate the point. Here’s the configuration for that chart:

Errors Chart

Latency

The last chart again uses an Istio metric — this time, it’s “Server Response Latencies”, grouped by “destination service name” and using the 99th percentile aggregation. I am not filtering out errors for this one, though it might be a good idea. Here’s how it’s configured:

Latency Chart

Automation

Creating such a dashboard manually in Stackdriver is not difficult, assuming that the data is available, but it is toil that should be automated if possible. With the beta release of the Dashboards API, it now is! Users can also use the API to, for example, copy dashboards between Workspaces or create additional standardization via automation. Let’s take a look at the details.

API

The API is documented here, but the general idea is pretty straightforward — we simply call the projects.dashboards.create method passing a Dashboards object, which contains a name, description, and a set of Widget objects that specify the charts themselves.

Structure

Because Widgets are the basic building blocks we create a dashboard from, let’s start there. We already have our chart definitions, and it’s just a matter of converting them to the JSON representation expected by the API. We start by defining the dashboard itself, which needs:

  1. Name, formatted as “/projects/<project ID or number>/dashboards/<ID>”— for creation, we can leave this empty
  2. displayName — the name to be shown when the dashboard is actually created
  3. A “root” object — the actual content to be displayed, which contains a Widget
  4. Default options to be applied to the dashboard at load time

The root is where we define the actual content of the dashboard. For our example, we need to create a dashboard with two columns in it and three charts. So the root of the dashboard is a GridLayout with two columns. It then needs to include three additional widgets, each of which will specify an xyChart. Each chart needs to specify a DataSet object, which in turn uses a TimeSeriesFilter to actually query the data.

Implementation

Let’s take a look at how we can actually implement our dashboard using this information.

Request Rates

Here’s the JSON definition for the Request Rates chart. I created this manually, but you can use the API Explorer to create your own baseline and simply populate it with parameters.

Request rates chart definition

Some things to note in the defintion:

  • The filter specifies the resource and metric. This is also where we would specify, for example, the cluster name, service name, or other attributes if needed.
  • We’re using the default aligner and the sum reducer to simply total up the requests and group them by the destination_service_name metric label.

Error Rates

Next, we need to define our error rates chart. Here’s that definition:

Error rates chart definition

Note that this time we’re using an additional filter to only count requests where the response code is not 200. Otherwise, this definition is the same as the request counts chart.

Latencies

Finally, we define the latencies chart:

Latency chart definition

We’re using the 99th percentile reducer AND filtering the data to just successful requests.

Dashboard

It’s time to put it all together — here’s the full dashboard definition:

Note that we’re specifying a 4-column gridLayout, which will actually results in a 3-column dashboard. We also specify each chart as a widget in the layout. We submit this as a POST request body to https://monitoring.googleapis.com/v3/projects/${PROJECT_ID}/dashboards (as per the documentation) and get this dashboard as a result:

Services Dashboard created via API call

Voila! We now have a way to automate the creation of dashboards in our workspaces!

In conclusion…

I hope you find this useful and start creating and automating your own dashboards using the API based on this simple example. In my next post, I’m going to repeat this exercise and create a dashboard to visualize the health of my Kubernetes cluster that the services are running on. Until then — thanks for reading!

--

--

Yuri Grinshteyn
Google Cloud - Community

CRE at Google Cloud. I write about observability in Google Cloud, especially as it relates to SRE practices.