o11y Done Wrong — Lessons Learned from Monitoring Production Systems

Published in

torq

5 min readJul 20, 2023

You wake up dizzy at 3AM from your “nice friend” PagerDuty yelling, “I have something broken in production. Please come and help me!” You open one of your system monitoring dashboards, and the first feeling that comes up is, “What the heck is going on 🤯?”

Does that sound familiar?

I found myself in this exact situation, looking for a needle in a haystack in a bunch of metrics, visualizations and overwhelming data, which we all know is the worst enemy of our cognitive load.

Two approaches to visualize data. Where do you feel more of the cognitive load burden?
Image credit Arngren, https://xd.adobe.com/ideas/wp-content/uploads/2020/07/6-ways-to-reduce-cognitive-load-for-a-better-ui-3.png.webp

In the following post, I will share with you what I learned after this incident, as well as a bit of our o11y stack and some practical tips to design, implement and maintain dashboards that I found very useful.

Before we go forward, let’s go back

At Torq, we help organizations automate their security workflows. That means that we handle a very critical and sensitive process in our customers’ organizations, and they trust us and our system’s reliability.

As a result, we invest a lot in monitoring our systems to make sure our critical services are healthy and, if not, to quickly pinpoint the problematic areas.

We utilize many tools to achieve our system o11y, but the core parts are:

Metrics

We collect metrics from our infrastructure services (K8S, Temporal, etc.) as well as some defined metrics for monitoring high-level or more focused events.
With Grafana & Alert Manager, we visualize and alert on abnormal metrics streamed into Prometheus.

Google Traces & Logs

These serve us for tracing requests between services, which is nicely integrated with Logs that we mainly use for error reporting (“no news = good news”).

We also created a super-cool open-source CLI tool called gtrace to work with Google Traces much more effectively — but that’s for another post.

Is what was good before, still good now?

Let’s go back to the night when we had our lovely incident. After finding and fixing the issue, we went back to sleep and woke up the next morning to analyze what went wrong. The biggest problem I had, was that we had a dashboard filled with irrelevant data that caught my attention and made me spend time on it.

As a result, we decided to re-evaluate our monitoring dashboard and, more importantly, our monitoring strategy.

Here is a digestible list of the most valuable tips we found useful along the way:

Define your KPI — how do you know if a latency of 500ms is good, bad, or normal? Setting a clear goal is important for every project, especially in your monitoring ones. You have to set a goal to know what you’re aiming for (be ambitious, but not too much). Sometimes, especially at the beginning, it’s hard to decide on those goals. My suggestion is to pinpoint something that is reasonable and schedule a self-revisit to tune it better.

Graphs must answer a question — a dashboard should tell a story or answer a question. Make sure to keep them simple and focused on a specific question. This is another reminder to the great UNIX idiom of “do one thing and do it well.”

Show only the most relevant information — if the question for the graph is “which servers are affected by an issue right now?” choose to display only the problematic ones instead of listing all the services and indicating which are healthy and which are not.

Reviewing the process — as we do with code and design documents, getting feedback is a crucial part of keeping quality up, sharing knowledge, and making sure your teammates (and even future you) understand the dashboard and reduce the voodoo parts.

Decide what you (do not) monitor — we have a lot to monitor, and this is the hard part, in my opinion. To keep things focused, we must decide what is important enough to be monitored and what is not. Too many alerts and data can take over our more important parts.

Last but not least, familiarizing yourself with prevalent monitoring approaches can assist in focusing and finding inspiration while constructing your dashboard and clarifying its objectives. Let’s explore several well-known methods and determine the best fit for each one.

USE method
Utilization: % time the resource is busy (node CPU usage)
Saturation: # work items the resource has to do (queue backlog)
Errors: # of error events

This method is more focused on reporting causes of issues and is mostly used for infrastructure resources such as CPU, memory, network, etc.

RED method
Rate: Requests per second
Errors: # of failed requests
Duration: time it takes for these requests to be handled

This method is more suitable for services, especially in a microservices environment. It is good for alerting and measuring SLAs. A well-designed RED dashboard can be a user-experience representation.

There is a nice quote by Grafana that I like: “The USE method tells you how happy your machines are, the RED method tells you how happy your users are.”

The Four Golden Signals
The great Google SRE handbook states that if you can measure only 4 metrics of a user-facing system, your focus should be on these four:

Latency: time taken to serve a request
Traffic: how much demand is placed on the system
Errors: rate of requests that are failing
Saturation: how “full” the system is

This method is very similar to the RED one, but it also includes saturation.

Let’s take a look at how these theoretical concepts are represented in dashboards and visualizations.

Playground time

Grafana Labs has introduced an innovative sandbox where users can test their query visualizations and be inspired by many examples. I’ll showcase some dashboards that effectively illustrate the concepts we discussed.

Four Golden Signals visualization of a generic service

[When presenting data, it’s important to set a threshold to provide context to understand if the data falls outside of normal ranges]

Well-defined KPIs can aid in focusing on what really matters

To Sum Up

I hope you enjoyed learning about our approach to monitoring and that it sparked your interest in examining your own monitoring strategies. More importantly, I hope it prompted you to consider ways to make your response to future production incidents more streamlined and efficient.