Get SLOs from Istio Service Mesh, Sloth & Grafana Mimir

Published in

Shadow Tech Blog

9 min readOct 21, 2023

What are SLOs and why do I care?

Suppose there is one concept that is the epitome of Site Reliability Engineering, it would probably be Service Level Objectives (SLOs). Whenever someone in your company asks “How can we tell if we are stable after a new release ?”, the answer is an SLO of some sort.

The same goes for discussions around infrastructure costs, or the delicate balance between developing new features or spending time refactoring: in all those moments, you need a metric to gauge how stable the service is; otherwise, how would you know if you are overspending or if you need to slow down and raise the bar a little?

The fact is, while SLOs are quite easy to understand at a macro level, it is a way broader and more complicated topic when you start diving into it. Moreover, as observability continues to evolve, more tools emerge, bringing with them increased complexities and technical obstacles to overcome.

In this article, I will demonstrate one way to implement SLOs using a set of tools running in Kubernetes clusters.

A bit more on SLO vocabulary

To become more familiar with SLO usage, let's talk a bit about Error Budget and Burn Rate. This will help later on while reading SLOs on a dashboard, trust me on that.

Error Budget: How much instability is accepted in the system?

An SLO of 100% is inconceivable as it would require a huge amount of effort and resources. Even with substantial resources, achieving this is nearly impossible.

As soon as you are targeting a reasonable SLO, you implicitly admit a degree of instability in your system: some requests could fail or take too long to be processed without dropping your SLO below its threshold. This is your error budget.

Burn Rate: How fast are you consuming your error budget during this period?

Each time a request fails, it consumes a portion of your error budget. By looking at how many times this happens, you can tell how fast your error budget is being consumed. A burn rate of “1” indicates stability equilibrium, meaning your entire error budget will be exhausted by the end of the period without compromising your SLO.

Our current toolbox

Here is the observability stack that will support this computation:

We use Prometheus to collect metrics from targets within all our Kubernetes clusters. Classic but efficient.
Metrics are forwarded to Grafana Mimir for long-term storage and to aid Prometheus scalability. This setup means that our Prometheus instances will only access the most recent data, rendering them unsuitable for computing SLOs over extended periods.
Also keep in mind that multi-tenancy is enabled in Grafana Mimir, each of the Kubernetes clusters producing data being one tenant in Mimir. This setup facilitates the application of various limits to each tenant, helping to prevent the noisy neighbor problem.
The same monitoring cluster hosting Mimir also hosts the usual suspects: Grafana for dashboarding & Alertmanager for alerting.

Defining SLOs for HTTP services

Now that we have outlined our observability stack, we need to talk a bit more about which SLO we intend to distill through it.

In traditional HTTP-based applications, two signals are commonly found in SLOs:

Availability: How many requests were served properly?
Latency: How many requests were served promptly?

Expressed as percentages, these two SLOs offer a good representation of the user experience. However, they are not directly used in alerts due to various alerting issues. The Google SRE workbook provides a detailed chapter on this topic. To create “Multiwindow, Multi-Burn-Rate Alerts,” we need to refine these metrics further.

Now that we have a better understanding of the goal we want to achieve, let’s bring more tools to the table!

The big plan

Since showing often surpasses explaining, let’s do a bit of spoilers there and show the final stack that we are going to implement:

This is a bit more complicated than our previous stack isn’t it? Fear not, as we are going to check out the role of these new additions and everything will make sense very soon!

Sourcing raw materials

First, we need a method to generate the raw metrics about our HTTP requests.

A Service Mesh like Istio is a good option for gaining insights into the HTTP requests processed in a given service though you could achieve similar results with other sources, such as your Ingres Controller, or the HTTP metrics natively exposed by your application framework.

Istio proxies exposing HTTP metrics for Prometheus

All those sources are not created equal though: each step in the HTTP request path has its purpose and imposes various constraints, like authentication, rate limiting, or caching.

Istio is quite low in the request path, meaning that requests being sampled there into metrics have usually gone through a lot already. These metrics are closer to the application and are telling the story about what is happening in the very last steps. If you want to tell the same story from the point of view of a user, you may want to use a higher-level source like the Ingress Controller or even probes running outside of your infrastructure calling your services to act as real users.

Remember, the higher you go, the more components will influence the result of the SLO: it’s your call to decide whether to include the Web Application Firewall in the stability assessment of the application it protects, or to assign different teams to oversee these components, each with their SLOs to monitor.

Boil them down

We have HTTP metrics from Istio in our system: Prometheus houses the most recent ones, but the real action occurs in Grafana Mimir, where we store the metrics history, ready for dashboarding and alerting.

Now, let’s introduce the final two components to the mix.

Firstly, the real star of the show: Sloth is a tool that can transform Prometheus metrics into SLO records based on the best practices discussed above (the multiwindow multi-burnrate approach).

Here is an example of such a resource:

---
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: my-first-slo
spec:
  labels:
    owner: my-team
  service: my-service
  slos:
    # 99.9 = 1 failing request (5xx and 429) every 1000 requests
    - name: "requests-availability"
      objective: 99.9
      description: "My Service SLO based on availability for HTTP request responses."
      sli:
        plugin:
          id: "sloth-common/istio/v1/availability"
          options:
            namespace: "my-namespace"
            service: "my-service-name"
      alerting:
        name: MyServiceHighErrorRate
        annotations:
          summary: High error rate on My Service requests.
        pageAlert:
          labels:
            severity: warning
        ticketAlert:
          labels:
            severity: critical

In this example, we utilize a Sloth plugin specifically designed for Istio-based SLOs, as it is simpler to implement. You can explore the available plugins here. Remember, these are just shortcuts; you can craft your own PromQL queries based on whatever criteria you deem relevant.

By running Sloth in our observability cluster and supplying it with custom resources of the type PrometheusServiceLevel, we initiate a process that reads raw metrics, applies a set of PromQL queries, and stores their outputs into:

Prometheus records: Sloth is going to fill in various records with the result of the queries in a variety of time windows etc.
Prometheus alerts: Based on those records, you can ask Sloth to create alerts following the best practices

These two things are written as custom resources in PrometheusRule, a well-established method of dynamically configuring a Prometheus instance with rules through the action of a prometheus-operator.

The problem is that we don’t want to process these records and alerts in Prometheus! Our stack mostly relies on Mimir, with Prometheus serving merely as a collecting agent. Moreover, all these resources reside in our observability cluster where the local Prometheus lacks access to the metrics we are requesting.

Loading Prometheus rules into Grafana Mimir

This is where our last component enters the scene: we can rely on Grafana Agent to help us load PrometheusRule resources into Grafana Mimir Ruler.

A recent addition to the Flow component, called mimir.rules.kubernetes, has been added for this purpose. By deploying a small Grafana Agent in our observability cluster and configuring it to use this component, we can then ensure that PrometheusRules created by Sloth are processed by Grafana Mimir, granting full access to the necessary metrics!

One last thing though, remember that our Grafana Mimir is running with multi-tenancy enabled? This means that rules executed there must be cognizant of the tenant they are associated with. This is achievable through component configuration options, allowing us to create one component per tenant and use Kubernetes label selectors to dictate which rule should be loaded onto which tenant.

Sloth producing rules & Grafana Agent loading them

The configuration would look somehow like this:

mimir.rules.kubernetes "dev" {
  address = "http://<your mimir ruler>.svc.cluster.local:8080"
  tenant_id = "dev"
  rule_selector {
    match_expression {
      key = "tenant"
      operator = "In"
      values = ["dev", "all"]
    }
  }
}

mimir.rules.kubernetes "staging" {
address = "http://<your mimir ruler>.svc.cluster.local:8080"
  tenant_id = "staging"
  rule_selector {
    match_expression {
      key = "tenant"
      operator = "In"
      values = ["dev", "all"]
    }
  }
}

A few notes though:

This configuration lacks authentication between the agent and Mimir. While this setup is acceptable for initial stages within the cluster, you might consider adopting authentication similar to what you have for external access to Mimir (you do have authentication for this, right? 🙂).
The label selector matches the tenant name and a wildcard value “all,” facilitating the deployment of a single PrometheusRule applicable to all tenants simultaneously.

What you finally get

At the end of the day, you gain two major outcomes from this implementation:

A dashboard created by Sloth for integration into your Grafana.

Warning: While the current value of the SLO is quite straightforward,
error budget & burn rates requires a bit of a habit to get comfortable with.

Remember that a Burn Rate of "1" means that your service will consume
all its error budget by the end of the calculation period (a month, typically).

Lower than that: you are fine, higher than that, you are at risk of being
out of error budget at some point so you should act to raise the SLO before
it is too late.

A set of alerts to notify you of SLO violations, enabling your team to investigate potential issues promptly.

Conclusions & opportunities

There you have it: from producing the raw metrics to displaying them in dashboards & getting alerted.

Of course, the examples from this article rely on the most simple use case as we are computing SLOs for HTTP-based applications while also having Istio producing the necessary metrics for us.

Things become more complex when dealing with other types of workloads, such as batches or workers. In these scenarios, you must redefine what availability and latency mean. Perhaps you might consider the number of failed batches for availability and the processing time of tasks for latency.

Depending on the metrics you employ, leveraging one of Sloth’s SLI plugins might not be feasible. However, you can create your own PromQL queries, albeit with more effort.

The observability stack depicted in this article might differ from yours. Here are some additional considerations that might pertain to your situation:

Grafana Agent is unnecessary if you prefer to use Prometheus directly to process the Prometheus rules created by Sloth: the prometheus-operator will do the heavy lifting just fine.
If you use Thanos instead of Grafana Mimir: prometheus-operator can also be used to load those rules into Thanos Ruler (as long as you deploy this component using the ThanosRuler custom resource from this operator).
You might opt to deploy Grafana Agent in each of your clusters, configuring it to load rules from those clusters into a single tenant in your central Grafana Mimir. This approach requires exposing the Mimir Ruler API and implementing authentication, but it might suit your GitOps process if you prefer deploying PrometheusServiceLevel resources in each cluster rather than a central observability cluster.