Scaling the Grafana Observability Stack

5 min readMay 5, 2024

Grafana, with its myriad of products, has sedimented itself as an option for creating a telemetry collection, storage and observation solution. When I entered the company I’m currently at, a previous SRE made an initial setup to start collecting some observability data, composed of:

Fluentd for collection of logs
Loki for Storage
Prometheus for collection and storage of metrics

This setup was deployed on each Kubernetes cluster we had, with Grafana living on a “tools” cluster that connected to every data source.

It worked, for what it was, but we handled a handful of clusters, all on the same region of the AWS Cloud. With expansion on the horizon for more clusters, more regions and more ingestion of data, all while providing good visibility for our devs, we need to give some thought to how we want the services above (and more!) to be connected.

Grafana’s Services

The big three are Loki, Mimir and Tempo, which make up the LGTM stack. Each of these is a backend aggregation system for a different type of data: logs, metrics and traces. Grafana provides the entry point for querying and visualizing data coming from multiple data sources that use one of these backends.

They are similar between themselves, in the sense that they receive, store, and make data queryable. If we manage to figure out how to scale 1 of the big three, we can replicate the setup for the other 2 types of data collection.

The extra point here is regarding metrics, which are already very useful on their own but paired with alerts, they become a front line for problem detection. We can first determine how to manage metrics, extrapolate that to logs and traces, and finalize by also setting up alerts.

Metrics

Our starting point was a setup using only Prometheus, deployed via the Prometheus Operator, obtained from the kube-prometheus-stack Helm Chart:

This works well out of the box, with the Operator handling the ServiceMonitor, PodMonitor resources for /metrics endpoint discovery and PrometheusRule resources for Alerts.

It is self-contained, meaning it can be easily replicated across other clusters, and all that needs to be done is to add a new data source on Grafana.

The downside of this schema is the data distribution and the lack of long-term storage. Both of these can be mitigated with the usage of Mimir.

At the end of 2021, Prometheus was already aware that the course of action for metric collection and storage would entail having collectors deployed where necessary and sending data to a centralised location via remote writing. Prometheus was already able to do this, but if you are only retrieving metrics and forwarding them, you don’t need other capabilities like local storage, querying and alerting, as such, a new mode was created: Prometheus Agent Mode: https://prometheus.io/blog/2021/11/16/agent/

In this mode, Prometheus is stripped down to the bare minimum to act as a collector and forwarder of metrics. The kube-prometheus-stack Helm Chart allows the deployment of an instance of PrometheusAgent instead of a Prometheus instance, resulting in the following schema:

This architecture based on collectors and aggregators was already specified some years ago, but still works well enough today.

Our starting point was the Prometheus Operator, which at this point is only serving to deploy a forwarder of metrics. To expand this architecture for the remaining telemetry types, we can replace the Prometheus Operator with a different service: The OpenTelemetry Operator

This Operator allows for the creation of Kubernetes OpenTelemetryCollector Custom Resources that enable the configuration of data collection for the three types of telemetry data. In the case of metrics, to still be able to use the ServiceMonitor and the PodMonitor resources for metrics endpoint discovery, an extra component, the TargetAllocator, needs to be configured.

Logs and Traces

The specified above is the baseline that can now be used for the other types of telemetry data.

With the architecture in place, the Collectors can be configured to also forward logs and traces to the respective backends, and the backends can be configured to receive this data.

The most likely scenario is these clusters will be running on a cloud-managed service, such as EKS, GKE or AKS. Here it is important to have in mind the amount of data traffic and the costs that come from it, especially taking into account that if the backends are deployed in High Availability mode (across availability zones) the data will also be transferred across availability zones, which might add an undesired cost.

Extra: Alerts

Similar to Prometheus, Mimir also has a Ruler and an AlertManager component, meaning it can receive Alert Rule definitions and create alerts.

Mimir provides 3 native ways to manage alert rules:

Via the mimirtool CLI tool
Via the grafana/mimir/operations/mimir-rules-action GitHub Action
Via the HTTP configuration API

But a more Kubernetes-native way also exists. Using Grafana Alloy (previously Grafana Agent) we can configure the component mimir.rules.kubernetes to discover PrometheusRule CRDs and apply them to a Mimir instance.

This article serves as an information dump of some research made on how to structure an observability stack based on Kubernetes and Grafana tools. Ending with a possible solution on tools to use and how to set them to work together.

Scaling the Grafana Observability Stack

Grafana’s Services

Metrics

Logs and Traces

Extra: Alerts

Written by André Gomes

Responses (1)