Building monitoring dashboards for fun and profit

Yuri Grinshteyn
Google Cloud - Community
7 min readNov 18, 2020

In my last two posts , I explored the new “out of the box” GKE monitoring dashboard and using it to both set up alerting against an important resource and drilling down from an alert to figure out a problem. That dashboard is certainly great if you care about the entirety of your GKE fleet — but what if your primary charter is to maintain the reliability of a specific service? Well, for that, you’d need to create a service-specific dashboard. I’ve covered creating dashboards before, and you can see me talk about automating their creation using Terraform in this episode of Stack Doctor. But now, there’s a new Dashboard Editor — and I thought this would be a great opportunity to revisit this topic and use the new Editor to create an example dashboard capturing some key signals.

What’s in a dashboard?

As ever, the first question is — what should go on a dashboard? I covered this pretty well in both my post and video on automating dashboard creation, but let’s review quickly. My recommendation here is usually based on the Monitoring Distributed Systems chapter in the SRE book. If you have nothing else, start with the “golden signals” — traffic, errors, latency, and saturation.

From there, it’s just a matter of figuring out what is the best indicator for each of these. For example, if you have a way to count requests to your service, request count is a great way to measure traffic. If not, you can look at something like network usage for your workload. If you have a way to count responses based on status code, that’s a good way to count errors. If not — maybe counting log entries with a severity of ERROR or higher is good enough. You can even measure something like container restarts to track application crashes. If you’re tracking response latency using something like OpenTelemetry — fantastic! If not — no worries — maybe you’re writing out transaction times in logs, and you can create a log-based metric to track it. And finally — you can always start with a basic infrastructure metric like CPU or memory use to track saturation.

The important thing is that you have a good at-a-glance representation of the health of your service.

How do I dashboard?

As I’ve previously discussed, you have 3 options when creating a new dashboard — using the UI, the API, or Terraform. This time, I wanted to focus on using the UI specifically because there’s a new Dashboard Editor!

Following up on my post about troubleshooting, I figured I’d create a rather simple dashboard showing the basics of the health of my GKE workload. Let’s dive right in.

New stuff!

It is immediately apparent that the new editor looks quite different. There’s a list of widgets on the left, and a grid view in the actual dashboard pane. For comparison, here’s what the editor used to look like when you went to create a new dashboard:

Widgets

Instead of using the ADD CHART link at the top right, you start by selecting a widget. There are three new widget types — Gauge, Scorecard, and Text. The other four (Line, Stacked Area, Stacked Bar, and Heatmap) are familiar and still there. Let’s take a look at the new ones!

Gauge

The Gauge widget finally delivers a way for you to represent a single number on a dashboard as an indicator of health — along with a way to color-code it so that you can see whether things are healthy at a glance. Here’s a basic example:

This chart is showing how much of the CPU limit is being utilized by all containers averaged across a specific cluster (of course, you can use other aggregation types like Min/Max/Sum/99th percentile). One thing I really love about this is the ability to specify the ranges for the chart to change colors. For example, here’s how to set a warning threshold that will turn the chart yellow:

You can also set a danger threshold to turn it red:

For my purpose, this is actually a pretty good way to represent the “saturation” signal — if the containers, on average, are not close to the limit, there should be plenty of capacity left. If I care about a specific service, I’d probably want to filter this down further:

Scorecard

Next up — the Scorecard widget! It also allows you to show a single numeric value on the dashboard. However, unlike a gauge, it tracks the value over time:

Similarly to a gauge, you can use warning and danger thresholds to have the widget change colors:

You can also change how the value over time is displayed. The line is the default option, but you can also use a bar chart:

The Icon option will simply show you whether the value is within the desired range (with a green checkmark):

Or out of the desired range with a red one:

For our purposes, I actually really like this widget for visualizing latency, for example. It’s an easy way to quickly see whether your service is meeting a specific performance objective.

Text

The final new widget type is Text. At first glance, it’s not much to get excited about:

But this widget really shines once you select the Markdown option — this essentially turns it into a rich text editor (provided you’re comfortable writing Markdown). This is a great way to, for example, add documentation directly to the dashboard that can describe exactly what a user is looking at.

For example, you can use this widget to document the dashboard you’ve created so far to help orient a new user like this:

Layout options

By now, you might have noticed that you now have the option to do things like move widgets around and resize them. This is all part of the new “Mosaic” option in the editor:

By being able to resize charts as you see fit, you have a whole host of options available to you on how you want your dashboard to look — you’re no longer limited to just the basic columns view. For example — you can add a request count chart to your dashboard that spans both of the columns we’ve created so far:

And now, you have a dashboard that represents three of the four “golden signals” — traffic (as request count), saturation (as measured by CPU limit usage), and latency.

Basic vs Advanced configuration

The last new thing you should be aware of is the new distinction between configuring a chart using the Basic and Advanced configuration options. For example, if you add a Heatmap widget to your dashboard, you may see a message like this letting you know that you need to use the Advanced mode to configure it:

That option is at the top of the chart configuration options:

While the Basic mode should serve most users in most cases, there are times when you may, for example, need to explicitly specify the chart alignment. Switching to Advanced mode lets you do just that:

For our heatmap, you can then select options like the preprocessing step and the alignment to get the exact chart you want.

In summary

I am really excited about this new feature — and I’m really looking forward to seeing the new dashboards that y’all are going to create with the new widget, layout, and configuration options. Thanks for reading, and, as always — let me know what you’d like me to tackle next. These days more than ever — stay healthy out there!

--

--

Yuri Grinshteyn
Google Cloud - Community

CRE at Google Cloud. I write about observability in Google Cloud, especially as it relates to SRE practices.