Long-Term and Scalable HA Prometheus Clusters at Airy

Published in

Airy ♥ Science

8 min readAug 22, 2019

Airy leverages several CNCF landscape projects including Kubernetes, Istio, and Prometheus. We run several Kubernetes clusters for Production, Staging, and general-purpose infrastructure tools, and stores all metrics including Istio and Envoy traffic metrics to Prometheus in each cluster.

The Challenge

Prometheus wasn’t designed to be scalable, and durable long-term data storage. By design, it can’t be horizontally scaled as it stores data locally. Also, by default, it only stores 2 weeks worth of data without guarantee of durability. Vanilla Prometheus deployments won’t be enough for HA and Long-Term use cases. To achieve HA and Long-Term use case, we need to use a different approach!

There are several possible solutions that we’ve investigated. Let’s go through them one by one!

Prometheus Federation

Prometheus provides a way to aggregate metrics by Prometheus Federation. By aggregating, we can store less data with the same meaning, which means that we can save workloads in the system to query the same information. The federating Prometheus can be configured to store data with longer retention time. This can solve scalability problems in Prometheus, and we might also be able to store longer data due to the lesser data resulting in lesser workloads. But this still breaks the previous statement that said Prometheus wasn’t built to be durable long-term storage.

Prometheus Remote-Write

Prometheus provides a set of interfaces to federate it’s storage to other more durable storage systems. With this, we can send Prometheus data to storage more specifically designed for long-term data storing. But this would result in undesirable operational and complexity overheads to maintain those additional storage systems. We also want to still use the PromQL API to query metrics. Using those remote storage systems would mean using other query methods. Of course, we can use the Prometheus remote-read method, but this kind of query will be an expensive query.

Thanos

Thanos is an open-source project by Improbable-Eng, designed to make Prometheus scalable and store its data in a long-term manner. It is a set of components that can be integrated with your current live Prometheus deployment to turn it into a highly available, scalable metric system with unlimited storage. Thanos is fairly easy to set-up and to maintain, as we only need to apply several Kubernetes manifests and use a managed datastore like AWS S3 or GCS.

We chose Thanos as the best solution because it is the easiest to set up and provides us a centralized Prometheus dashboard/querier.

Our Setup

Before using Thanos, we already have several Prometheis (yes, this is the plural form of Prometheus) running live in our production and staging clusters. Integrating a new solution while maintaining those live data accessible would be a challenge! Fortunately, we can implement Thanos on a running Prometheus and still upload all its data to S3!

To implement Thanos, we have several mandatory and optional components. The mandatory components are the sidecar, storegateway, querier, and compactor. The optional component is the ruler. For the sake of clarity, we’ll add a “thanos-” prefix to identify each component for the rest of the writing.

The thanos-sidecar functions as data uploader, reading local Prometheus data and upload it to the S3/GCS buckets. The thanos-storegateway is used to query historical data stored in the buckets. The thanos-querier is used as our centralized Prometheus querier, we send PromQL queries via the thanos-querier. The thanos-compactor compacts data stored in the S3/GCS buckets, making queries on historical data more efficient. The optional thanos-ruler can be used to evaluate Prometheus recording and alerting rules, but it has a tradeoff that makes us opt-out of using it. For more information on how Thanos and its components work, please visit the docs!

Each Prometheus replica must be able to be identified uniquely using Prometheus’ external_labels configuration. Using the standard HA StatefulSet and ConfigMap setup, this is quite a challenge, as each replica will consume the same sets of external_labels from the same ConfigMap file. To solve this, thanos-sidecar provides us a templating engine to create Prometheus config files with unique labels. Instead of using the file from the ConfigMap directly for Prometheus, we’ll use the sidecar-generated file for each Prometheus replica. This means that we have to change Prometheus’ config file name in the ConfigMap. We’ll cover this and provide the manifests below!

First, we have to implement the sidecar into the StatefulSet for it to run alongside the current Prometheus. All the sidecars must be able to be reached by the thanos-querier, so we’ll also need a headless service for the thanos-querier to reach individual pods. Istio is known to be not compatible with headless services, so in our case, we’ll have to do a workaround to solve this (you can find it in our post! *if you are not using Istio, you don’t have to do the workaround!). Additionally, several Thanos parts also need a configuration to interact with the S3 bucket, so we’ll need to create a ConfigMap for that! Here are the Prometheus StatefulSet, ConfigMap, and headless Service manifests after integrating with Thanos.

#statefulset
…
  containers:
  - name: prometheus-server
    args:
    - --storage.tsdb.retention.time=3d
    - --query.timeout=5m
    - --config.file=/etc/prometheus-shared/prometheus.yml
…
  - name: thanos
    args:
    - sidecar
    - --log.level=info
    - --tsdb.path=/data
    - --reloader.config-file=/etc/config/prometheus.yml.tmpl
    - --reloader.config-envsubst-file=/etc/prometheus-shared/prometheus.yml
…
#configmap
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-server
  namespace: monitoring
data:
  prometheus.yml.tmpl: |-
    global:
      evaluation_interval: 30s
      scrape_interval: 30s
      scrape_timeout: 30s
      external_labels:
        federated_cluster: cluster-A
        replica: cluster-A-$(POD_NAME)…

You can see that we now have a new container (thanos) in the prometheus-server StatefulSet specification as a sidecar.

We named the config file in the ConfigMap with prometheus.yml.tmpl, then supplied the sidecar with two flags, — reloader.config-file and — reloader.config-envsubst-file, which are the input and output of the template config file, respectively. What we are actually templating is the replica label in the external_label, supplying it with $(POD_NAME), which will be templated into individual pod names of the Prometheus pods. If you have a running Prometheus server, you can easily add the sidecar by editing your manifests with kubectl edit command!

Thanos components also use a shared ConfigMap which defines the object storage to be used to upload metrics data. Here’s the YAML manifest.

apiVersion: v1
kind: ConfigMap
metadata:
  name: thanos-s3-config
  namespace: monitoring
data:
  storage.yaml: |
    type: S3
    config:
      bucket: “bucket-name”
      endpoint: “s3.us-east-1.amazonaws.com”
      insecure: false
      signature_version2: false

If you use S3, Thanos uses S3 Endpoint to define which S3 API Region to reach. You can find the list of S3 Endpoints here.

After integrating the sidecar and ConfigMap, we then deploy the other components, thanos-store-gateway, and thanos-compactor. The store-gateway functions as the old-data query gateway, querying historical metrics from our datastore/S3. By old-data, we mean data that are no longer stored locally in our Prometheus servers because they’ve been deleted by Prometheus retention policy. To deploy them, we also use StatefulSets with headless service in order for thanos-querier to reach thanos-store-gateway’s individual pods. Here are the manifests.

Thanos-compactor functions as a compactor, compacting historical data in our datastore by doing downsampling on the data to speed up queries from the thanos-store-gateway. The compactor will downsample raw data in the datastore into data with 5 minutes and 1-hour resolution. We can also tune the data retention through the command line flags! With thanos-compactor, we can reduce the size of the data and speed up our query considerably, with the tradeoff of data resolution. Thanos-compactor is a singleton, which means that we can’t replicate this component, nor run it on other clusters while interacting with the same datastore/S3 Buckets. We also have to turn thanos-compactor off when we manually modify the data in the datastore/S3. We use StatefulSet to deploy it. Here’s the manifest.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: thanos-compactor
  name: thanos-compactor
  namespace: monitoring
spec:
  podManagementPolicy: OrderedReady
  replicas: 1 #sigleton!
  selector:
    matchLabels:
      app: thanos-compactor
  serviceName: thanos-compactor
…
  spec:
    containers:
    - name: thanos-compactor
      args:
      - compact
      - --data-dir=/data
      - --sync-delay=30m
…

Note that we must only use 1 replica for thanos-compactor, as it is a singleton. We also must not deploy it again in another cluster if they are accessing the same bucket!

The last needed component is thanos-querier. The querier acts as a centralized query gateway for users. It implements Prometheus Query API, which means that we can treat is as a Prometheus data source for Grafana. It also has a Query UI similar to that of Prometheus. The thanos-querier needs to be able to reach individual Prometheus sidecars and thanos-store-gateways, both those in the same cluster and outside of the querier’s cluster. We manually specify the endpoints with the --store flag in the querier. Thanos-querier is a stateless service, so we’ll use Kubernetes Deployment! Here are the Deployment and Service manifest.

…
  spec:
   containers:
     - name: thanos-query
       args:
       - --log.level=info
       - --query.auto-downsampling
       - --query.replica-label=replica
       - --query.partial-response
       - --query.timeout=5m
       - --query.max-concurrent=200
…

Notice that we also supply — query.replica-label, which is needed by thanos-querier to do deduplication of data from the Prometheus endpoints.

After deploying all the needed components, we can access the Query UI through the thanos-querier service. You’ll have to add the needed Ingress or Istio’s Gateway resource to access it from outside the cluster! Here’s the Thanos Query UI:

To integrate it with Grafana, we just add a new Prometheus data source to a running Grafana:

Results:

We have run Thanos alongside our Prometheus for over 1 month at the time of this writing. Previously, our Prometheus retention time is 14 days. With thanos, we can now shorten it. We are now using 5 days of retention time. For data older than 5 days, we are using the thanos-store-gateway to query old data to our S3 datastore. Here’s an example 45-days Grafana query:

Query time is also quite fast for our use case. Here’s how long it takes to load around 14 Grafana panels with 45 days of time range (the screenshot above), taken with Google Chrome’s network inspection tool:

Additionally, we now have a centralized query gateway: the thanos-querier! We can query every connected Prometheus server through the thanos-querier! This reduces operational tasks of needing to query different Prometheus endpoints if you are operating multiple Kubernetes clusters. We can use Grafana variable to choose different Kubernetes/Prometheus clusters using the Prometheus federated labels!

Caveats & Wrapping Up

Running Thanos has several caveats that we think aren’t big matters: extra cost and possible latency overhead over vanilla Prometheus. To run the components, Thanos needs additional compute, memory, and storage resources. The architecture also introduces additional routes to access the metrics (querier->store-gateway, store-gateway->datastore/S3, querier->sidecar, and sidecar->prometheus). The documentation clearly covers the cost and performance matters. Depending on your use case, these caveats might or might not be problems!

To wrap this up, we have explored Thanos, a set of components to make our Prometheus deployments a long-term TSDB solution with Highly Available setup. With Thanos, we’ve enhanced our metrics system greatly, providing users with a single centralized query pane and a long term data storage, whilst making our Prometheus stack scalable and highly available!

Long-Term and Scalable HA Prometheus Clusters at Airy

The Challenge

Our Setup

Written by Yudi A Phanama