I’ve spent quite a bit of time looking at defining and configuring SLOs in Service Monitoring. And lately, I’ve been getting lots of questions about what happens next — once the SLO is configured, folks want to know how to use alerting to be notified about potential, imminent, and in-progress SLO violations. Service Monitoring provides SLO error budget burn alerts to accomplish just that, but using these alerts is not always intuitive. I set out to try these out for myself and document what I found along the way. Let’s see what happens!

In theory…

The SRE Workbook has a whole chapter devoted to alerting on SLOs. I don’t think I need to reproduce it here, but there two classes of considerations that are very…


In my last two posts , I explored the new “out of the box” GKE monitoring dashboard and using it to both set up alerting against an important resource and drilling down from an alert to figure out a problem. That dashboard is certainly great if you care about the entirety of your GKE fleet — but what if your primary charter is to maintain the reliability of a specific service? Well, for that, you’d need to create a service-specific dashboard. I’ve covered creating dashboards before, and you can see me talk about automating their creation using Terraform in this episode of Stack Doctor. …


In my last post, I reviewed the new GKE monitoring dashboard and used it to quickly find a GKE entity of interest. From there, I set up an alert on container restarts using the in-context “create alerting policy” link in the entity details pane. This time, I wanted to have a go at troubleshooting an incident using this setup.

The setup

The app

You can see the full code for the simple demo app I’ve created to test this here. The basic idea is that it exposes two endpoints — a / endpoint, which is just a “hello world”, and a /crashme endpoint, which uses Go’s os.Exit(1) to terminate the process. I then created a container image using Cloud Build and deployed it to GKE. …


My interest in observability in Google Cloud developed in large part in the context of working with GCP customers running workloads on GKE, and one of my very first posts here covered using Stackdriver for those workloads. The very first episode of Stack Doctor also went over what were at the time the “new” GKE monitoring capabilities. This was over two years ago, and there have been some great updates to those capabilities since then. I thought it was time to revisit Cloud Ops for GKE, have another look at the dashboards, and try out the new capabilities. …


Some time ago, I looked at using the Service Monitoring API to create basic SLOs against “out of the box” services like App Engine. This functionality has seen a lot of updates since then, and there’s now Terraform support for creating custom services and SLOs. I wanted to have a go at this myself to see how it works.

Creating the service

SLO Monitoring does a great job of identifying services for you if you’re using things like Istio, App Engine, or Cloud Endpoints. But what if your service is on GCE, for example? …


One of the main benefits of using an all-in-one observability suite like Stackdriver is that it provides all of the capabilities you may need. Specifically, your metrics, traces, and logs are all in one place, and with the GA release of Monitoring in the Cloud Console, that’s more true than ever before. However, for the most part, each of these data elements are still mostly independent, and I wanted to attempt to try to unify two of them — traces and logs.

The idea for the project was inspired by the excellent work Alex Amies did in his Reference Guide on using OpenCensus to measure Spanner performance and troubleshoot latency. …


In my last post, I tackled my first project with OpenTelemetry and built a basic demo to show how to use distributed tracing and the Stackdriver exporter. I chose Go for that exercise because at the time it was the only language that had a Stackdriver exporter for tracing available. This time, I wanted to attempt using OpenTelemetry for metric instrumentation and noticed that opentelemetry-go does not have a Stackdriver exporter for metrics ready yet. I attempted to use the Prometheus exporter instead, but could not figure out how to make it play nice with the Mux router and switched to Node.js …


Service Level Objectives or SLOs are one of the fundamental principles of site reliability engineering. We use them to precisely quantify the reliability target we want to achieve in our service. We also use their inverse, error budgets, to make informed decisions about how much risk we can take on at any given time. This lets us determine, for example, whether we can go ahead with a push to production or infrastructure upgrade.

However, Stackdriver has never given us the ability to actually create, track, alert, and report on SLOs — until now. The Service Monitoring API was released to public beta at NEXT London in the fall, and I wanted to take the opportunity to try it out. …


Toward the end of last year, I had the good fortune of publishing a reference guide on using OpenCensus for distributed tracing. In it, I covered distributed tracing fundamentals, like traces, spans, and context propagation, and demonstrated using OpenCensus to instrument a simple pair of frontend/backend services written in Go. Since then, the OpenCensus and OpenTracing projects have merged into OpenTelemetry, a “single set of APIs, libraries, agents, and collector services to capture distributed traces and metrics from your application.” …


Introduction

One of the more interesting concepts I’ve been hearing more and more about both from customers and from folks I respect and follow in the industry is the idea of “monitoring as code”. This is, of course, a subset of the “everything as code” movement, but it’s something that really resonates with me. Specifically, I have talked to lots of folks recently who are interested in automating the setup of the monitoring configuration when new services or new projects are rolled out. That generally includes two main things — alerting and dashboards. Automating the creation of alerting policies in Stackdriver has been available for some time through the use of the relevant API or Terraform. …

About

Yuri Grinshteyn

CRE at Google Cloud. I write about observability in Google Cloud, especially as it relates to SRE practices.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store