Some time ago, I looked at using the Service Monitoring API to create basic SLOs against “out of the box” services like App Engine. This functionality has seen a lot of updates since then, and there’s now Terraform support for creating custom services and SLOs. I wanted to have a go at this myself to see how it works.
Creating the service
SLO Monitoring does a great job of identifying services for you if you’re using things like Istio, App Engine, or Cloud Endpoints. But what if your service is on GCE, for example? In this case, you need to define it as a custom service, which will then allow you to define SLOs against it.
Here’s how to define a “monitoring service” in Terraform:
The service definition is actually very simple — you just provide a service ID unique to your project and a display name. Once you run “terraform apply”, the service is then visible in the Console:
From there, you can use the UI to create an SLO against it:
Note that you have to use “Other” as the metric — custom services don’t have an “out of the box” understanding of availability and latency. So, you need to have a good SLI for your service. You could use something like a log-based metric, a metric emitted by the Google Cloud Load Balancer if you’re using that, or a custom metric being written by the service. Let’s take a look at defining an SLO using the latter.
Defining the SLO
Here’s how to define an SLO using Terraform:
There are 3 main things to consider here:
- The basics — the resource ID, the SLO ID, the service you’re defining the SLO against, and the SLO display name.
- The SLI — are you going to be using a request- or windows-based SLI? If request-based — how will you count total requests and differentiate between good and bad requests?
In my example, I’m using a service that’s been instrumented to emit two separate metrics — one to count all requests and another to count errors. This makes things quite simple.
- The goal — what’s the actual target for your SLO?
In this example, the goal is that 99% of requests are successful over a rolling 1-day period.
Creating and validating
At this point, run
terraform plan to make sure everything is correct:
If everything looks correct, run
terraform apply to create the service and the SLO(s):
Note that my file has two SLOs — a request-based one for availability and a windows-based one for latency. That’s why 3 resources are being created.
At this point, you can go back to the console and check your new service:
I’ve clearly not set my availability target correctly (or my service is having some serious issues) — I should absolutely revisit this before I take the next step to set up an error budget burn alert on this.
I’m really excited to see service and SLO support come to Terraform, and I hope lots of folks will take advantage of this to extend their automation capabilities. At this point, all of the major monitoring primitives can be created automatically once a project is up — this is great news! Thanks for reading, and let me know what you think!