Thanos Ruler and Prometheus Rules — a match made in heaven.

How to use Thanos Ruler in Kubernetes with Prometheus Rules and make your life easier.

Hélia Barroso
5 min readJun 7, 2023

If you are using Prometheus for metrics collection and Thanos for enabling Global Vision, you may find it beneficial to use a lower retention time for your metrics in Prometheus, as to leverage storing metrics in Object Storage. However, opting for a lower retention time may present challenges when your teams require Alerting and Recording Rules to be evaluated for longer periods than your retention time allows.

This scenario was encountered by my team, while moving away from a costly legacy Virtual Machine Infrastructure, where we used a 5-day retention period in Prometheus. In Kubernetes as we wanted to go for a much lower retention time, but Rules were a consideration, specially for an SLO tool, which necessitated even larger windows for Rule evaluation, than we previously allowed.

To make our dreams of a lower retention time come true we went searching, but it didn’t take long to know that, once more Thanos had a solution for that. This is what we learned in the process.

Disclaimer

You need to have some familiarity with Monitoring concepts, as well as with the Prometheus and Thanos architecture, also with Kubernetes and Prometheus Operator. For a brief introduction you can check this article.

Overview

There are various ways to use Prometheus alongside Thanos, depending on your specific requirements. Either using Thanos Sidecar, Thanos Receiver, or both can do the job. For a comprehensive understanding of all Thanos components you can check their documentation.

Thanos Overview

Our current setup uses Thanos Sidecar in a similar way as described above. In each Kubernetes cluster, we have a Prometheus instance accompanied by a Thanos Sidecar, and General Thanos Query to achieve a Global view on Grafana. All deployed relying mostly on community helm charts, with our custom configurations, as it is the 4h retention time for Prometheus.

In order try to evaluate Alerting and Recording rules for longer than 4 hours, and with our eyes set on Thanos Ruler, before starting the work we had some additional considerations. The Rules needed to be as dynamic as possible, as they are created in a myriad of ways and by different tools. Additionally, with no previous experience of Ruler, we also needed to consider possible downsides to using it.

Going for Ruler with our eyes opened

By definition, Thanos Rule or Ruler allows for the evaluation of Recording and Alerting rules against a Query API, then sends the results directly to remote storage. In a way, it works as a combination of Prometheus + Thanos Sidecar, but without the metric scraping and querying capabilities provided by Prometheus.

A thing to take into consideration with Ruler, since relies on a Query API to get metrics for evaluation, Query reliability is crucial to ensure proper functioning of Ruler. The Thanos documentation recommends setting up certain alerts to manage this risk, which I highly recommend.

When deciding how to deploy Ruler, we initially looked into Bitnami Thanos Charts, since we already use them for the rest of Thanos deployment. However, as it required the Rules to be passed via ConfigMap, to make it work would require changing the way we create our current Rules, or making some major changes to the charts, neither ideal solutions.

Next, we explored Prometheus and Prometheus Operator as potential options. Since Prometheus Operator offers the Prometheus Rules Custom Resource Definition (CRD), which we were already using for our Recording and Alerting Rules. And we could use ruleSelector to match Prometheus Rules for Thanos Ruler, in a similar matter that we do for Prometheus, it was a perfect match.

Not all examples in charts are born equal

Having selected the charts to use, the remaining steps involved adding all the necessary configurations to Ruler, such as Thanos Query, Alertmanager, Storage and ruleSelector. However, we encountered challenges with the Alertmanager configuration. Initially, we attempted to use the configuration examples provided with the chart, but those didn’t work. After some digging, the solution was to add the configuration via extraSecret:

thanosRuler:
thanosRulerSpec:
alertmanagersConfig:
key: alertmanager-configs.yaml
name: thanosruler-alertmanager-config
objectStorageConfig:
key: objstore.yml
name: general-objstore-secret
ruleSelector:
matchLabels:
app: ruler
volumes:
- name: object-storage
secret:
secretName: general-objstore-secret
queryEndpoints:
- http://global-thanos-query-dns
extraSecret:
name: thanosruler-alertmanager-config
data:
alertmanager-configs.yaml: |
alertmanagers:
- static_configs: ["alertmanager-dns"]
scheme: http
timeout: 30s
api_version: v1

Regarding storage, we define it via objstoreConfig in the Thanos Charts, which creates a secret that we also use for Ruler. Thanos was just a question of providing the DNS of the Global Thanos Query, and since they run of the same Kubernetes Cluster, just need to provide the Service DNS. Similar approach for the Alertmanager DNS.

Additionally, for the Rules, we added the ruleSelector to match a label that we defined on the Prometheus Rules, as we have in the example bellow. This is also a easy way to differentiate from the Rules to be matched by Prometheus.

Prometheus Rules and Unexpected wins

After deploying Ruler was just a question of creating the Prometheus Rules we wanted Ruler to evaluate with the proper labels, an example below:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
app: ruler
name: generic-prometheus-rule
namespace: namespace
spec:
groups:
- name: recording_rules
partial_response_strategy: warn
rules:
- expr: vector(0)
labels:
my_label: example

Some things we also found in the process of working with Ruler, was for some Recording Rules with bigger evaluation windows, opting for a Partial Response was the way to go. Dealing with some timeouts on Thanos Store and Query here and there didn’t affect the reliability of the Rule, when we are talking about windows of days. Although it is not recommended for Alerting, as it can cause issues to be missed. Once more, here the alerts recommended to managed this are the way to go.

But an unexpected win with Ruler and the Alerts we defined was some of the timeouts were related to issues with Prometheus, and not Thanos Store or Query. Inadvertently, we gained in Ruler another tool to monitor Prometheus.

Final Thoughts

Not much to say after this, besides the process of digging through Charts and Prometheus Operator to find a solution was a lot fun, but I might be biased since I work with Observability Tools for 3 years. Hope this might save you some time if you need to use Ruler in Kubernetes. Any questions and feedback are welcomed 😄.

--

--

Hélia Barroso

Passioned about Observability & SRE. Also spend a lot of time with cross stitch and reading books. DevOps Engineer @Five9