How We Improved Our Monitoring Stack With Only a Few Small Changes

Aviv Jacobs
Riskified Tech

--

When thinking about monitoring and observability, a lot of the time, each improvement/upgrade we want to do seems like a big change.

This is not necessarily the truth.

We need to align with the organization’s needs, and our customers’ needs (internal users), and follow that to get to the best solution we can.

Our customers may be the developers in the company, the team that manages the monitoring of the company itself, or any other persons who use our monitoring system.

As there are many ways and tools to create and use a monitoring system, it can be confusing what you should focus on, which tool you should have, what features are relevant for you, and where the pros outweigh the cons and not the other way around.

In this post, I will tell you about our process of improving our monitoring system at Riskified.

Where we started — current state

At Riskified, we have multiple clusters, and our main monitoring tool is the well-known Kube-Prometheus-stack, which consists of Prometheus, Grafana, and Alert-manager for each cluster alongside Thanos (and a lot of other tools for the complete image of observability and monitoring) to centralize metrics from all environments in one place. It’s a mono repo under SRE control, and some of the developers’ configurations are stored there. Also, we have centralized Grafana so we can view all environments in one place.

What to consider when improving your monitoring system

As a team, we scheduled a first meeting to discuss our goals and pain points regarding our monitoring system. In that meeting, we questioned everything we use in the monitoring stack and asked as many questions as we could.

We started with our pain points and goals:

  • Bottleneck on changes in the monorepo (Kube-Prometheus-stack) — We wanted to give the developer’s teams control over their rules and alerts, and split the alert-manager configuration so that each team could manage the configuration relevant to them without needing our approval. The SRE would manage the main configuration.
  • Prometheus crashed a lot due to high cardinality in the “customers” series, and we lost control of it.
  • Removing hardcoded secrets in Alertmanager config.
  • Inability to silence alerts easily in all environments — we had multiple alert managers (1 per cluster), which made it hard to manage (silences/grouping, etc.).

To solve these pain points and achieve our goals, we raised a lot of questions. If you are managing a monitoring system yourself, maybe you also thought about some of these:

  • How do we improve Prometheus’ cardinality?
  • Do we want Prometheus per namespace?
  • Should we split the Prometheus stack into several charts or leave it as it is now — one chart (Kube-Prometheus-stack)?
  • Use Victoria-metrics?
  • Enterprise Grafana?
  • implementing a single AlertManager using Thanos ruler to serve all environments?

A lot of the ideas that came up were rejected immediately, along with the other ideas we wrote in a document. — Our guidelines were cost predictions, identification of risks, and that we didn’t want the change to have a massive impact on developer teams.

Overall, we wanted what everybody wants — a better monitoring system with less maintenance.

Should we have used a single Alertmanager? Pros and cons

The first thing we focused on was a single Alertmanager, which means one for all our environments. As I mentioned before, part of our pain was that we had one Alertmanager per cluster, which made it hard to manage (silences/grouping, etc.).

So we decided to make a list of pros and cons and explore best practices for a single Alertmanager.

This is what I came up with:

Option 1: What we currently have — multiple Alertmanagers, one for each cluster

Pros:

  1. Working, and no changes are required
  2. Configuration is pretty straightforward using the Kube-Prometheus-stack, and the development teams are used to that and familiar with it — no learning curve
  3. Works in an integrated way with Prometheus to assess alert rules and send notifications by email, Jira, Slack, and other supported systems
  4. Helps for high availability — for high availability, multiple alert managers are strongly recommended for production systems.
  5. In a multi-Prometheus environment, the best practice is to generate alerts as close to the data as possible

Cons:

  1. Multiple points to navigate between (Alertmanager for each cluster).
  2. Hard to scale and manage. Every change is required to change in each environment.

Option 2: Thanos ruler

“The Thanos Ruler component allows recording and alerting rules to be processed across multiple Prometheus instances. A ThanosRuler instance requires at least one queryEndpoint which points to the location of Thanos Queries or Prometheus instances. The queryEndpoints are used to configure the — query arguments(s) of the Thanos rulers.”(from Thanos’ Documentation)

Pros:

  1. Part of Thanos that we already use
  2. With a single Alertmanager and the power of Prometheus labels, alerts can be grouped by cluster, application, or on whatever else suits us. An event such as a rack failure that causes alerts in multiple Prometheus servers can be grouped back into one notification in the Alertmanager.
  3. We already have centralized Prometheus (Thanos) and Grafana. Alertmanager is the last piece of the puzzle.

Cons:

  1. For high availability, one Alertmanager is not recommended for production systems.
  2. “Ruler has conceptual tradeoffs that might not be favorable for most use cases. The main tradeoff is its dependence on query reliability. For Prometheus, it is unlikely to have alert/recording rule evaluation failure as the evaluation is local.
  3. For Ruler, the read path is distributed since, most likely, Ruler is querying Thanos Querier, which gets data from remote Store APIs.
  4. This means that query failures are more likely to happen. That’s why a clear strategy on what will happen to alert and during query unavailability is the key.” (from Thanos Documentation)
  5. Generated alerts should be as close to the data as possible.

When searching for information about single Alertmanager we came across the Prometheus operator. We had it installed already in all of our environments since it’s part of the Kube-Prometheus-stack. But, apparently, there are a lot of nice features of the Prometheus-operator that we weren’t familiar with.

That got us to our third and last option.

Option 3: Don’t change to a single Alertmanager, but improve it using Prometheus-operator

“The Prometheus Operator provides Kubernetes native deployment and management of Prometheus and related monitoring components. The purpose of this project is to simplify and automate the configuration of a Prometheus-based monitoring stack for Kubernetes clusters”. (from github):

Pros:

  1. Gives us the option to split the Alertmanager into multiple AlertmanagerConfig, which declaratively specifies subsections of the Alertmanager configuration, allowing routing of alerts to custom receivers, and setting inhibit rules. It also gives the dev teams control.
  2. Already part of the Kube-Prometheus-stack.
  3. In a multi-Prometheus environment, the best practice is to generate alerts as close to the data as possible.

After a POC of those 3 options, we saw that if we only changed a few small things using the Prometheus-operator, we could resolve most of the problems I mentioned earlier (and get all the pros of option 1).

So option 3 it was, and that got us to the following changes:

  1. Instead of having one configuration file for the Alertmanager with duplicated sections for each team:

Each team manages its configuration using the AlertmanagerConfig resource of the prometheus-operator:

2. Instead of having hard-coded secrets, we used a native Alertmanager configuration file stored in a Kubernetes secret.

3. To resolve the pain of managing silences in multiple Alertmanagers we had for each environment, we used Pager Duty to manage alert silences. That made silences easier and in one place.

Wrapping up

We went through a process of improving our monitoring system and, as a result, resolved most of the pain points we initially faced.

It just goes to show that sometimes a few small changes can make a big difference. It was a great learning experience for us, and we hope that our journey can serve as a source of inspiration for others.

If you’re working on your own monitoring system, we’d love to hear about it.

What challenges have you faced, and how have you overcome them?

Sharing our experiences can help us all continue to improve and optimize our systems.

--

--