Thanos: Highly Available Prometheus Kubernetes Clusters

Avner Zini
HiredScore Engineering
7 min readMar 17, 2022

In a world where thousands of services and applications are deployed in multiple infrastructures, monitoring in highly-available environments has become an essential part of every development process.

In this article, I will present the thought process and lessons we’ve learned from using Thanos to store our Prometheus metrics for multiple clusters in new infrastructure on EKS Multi-Cluster architecture.

Introduction

As HiredScore’s products and client base grew larger, we started the transition to Kubernetes and moved fast to adopt it, one of our important blockers, and probably the biggest, the monitoring infrastructure. We had some experience with the Prometheus / Grafana stack for monitoring, and we learned that we want to create a better, highly available, and resilient infrastructure, with data retention that would be feasible and cost-efficient, in addition, it will allow us to be prepared for HiredScore’s hyper-growth.

CNCF promotes several infrastructures that enable the resolving of these monitoring pain points, and to enable monitoring with high availability, data retention, and cost-effectiveness.

Requirements

  • A single point of observability will aggregate all the data from all clusters in any region.
  • Highly available and resilient infrastructure for Prometheus.
  • Data retention for all our application data.
  • Cost-efficient solution.

We chose to implement the Kube-Prometheus solution by Bitnami & Kube-Thanos by Thanos-io. The solution worked out very well and succeeded to answer all of our needs.

Let’s meet the players:

Prometheus — is a free software application used for event monitoring and alerting. It records real-time metrics in a time series database built using an HTTP pull model, with flexible queries and real-time alerting.

Thanos — an open-source CNCF Sandbox project that builds upon Prometheus components to create a global-scale highly available monitoring system. It seamlessly extends Prometheus in a few simple steps.

How does it work?

As you can see in the diagram, each EKS cluster holds in the same namespace two Prometheus pods that monitor by scraping the cluster behavior. Each Prometheus pod saves the last couple of hours in a dedicated PVC, after the defined retention time the data is sent to the S3 bucket using Thanos sidecar. In this way, we can save money on a low amount of local storage, and keep all the rest in a centralized place (S3).

To display the data on Grafana from the k8s clusters, we created a dedicated cluster that will be responsible to collect all the real-time (last ~2 hours) data directly from each cluster using GRPC that connects to the thanos-sidecar container (exposed on port 10901 by default) and from the S3 bucket (config-store) for the long-range data.

Let’s deep dive into the implementation details:

  1. The first phase was to implement kube-prometheus along with Thanos sidecar in each cluster.
  2. The second phase was to implement kube-thanos in the “aggregation” cluster. It’ll be responsible to collect all cluster’s real-time data from the clusters and from the retained data sent to the S3 bucket (ObjectStore).

Sounds great, so how do we actually do it?

First Phase

Here we focus on how to deploy and configure Prometheus along with Thanos sidecar in each cluster that we want to monitor.

Create a namespace named monitoring in each cluster:
kubectl create ns monitoring

Create a storage class to enable Prometheus to persist date:

kubectl apply -f prometheus-storage-class.yaml -n monitoring

Install kube-prometheus:

helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

Copy the relevant values that you want to configure into your local folder.

Some changes that need to be applied in the values:

Step 1:

Make Prometheus highly available:
Set Prometheus Replica Count — Number of Prometheus replicas desired (more than 2)

Step 2:

Define pod resources limit Prometheus resourcesDefine it to avoid Prometheus consuming all the service resources.

Step 3:

Enable Thanos sidecar creation

Step 4:

Changing Thanos sidecar service type from ClusterIP to LoadBalancer — It will create an AWS classic load-balancer endpoint that will expose the sidecar in the GRPC port (10901), then we can use this endpoint to route it via route53 to some DNS name thanos-prometheus-(cluster_name).
Expose Thanos endpoint in your own cluster in prometheus.thanos.service:

Now, after creating the CLB, we need to implement it in thekube-thanos manifest. Which we will get to later, in the second phase.

Step 5:

Disable Compaction and define retention — This is a very important step for the upload of the data via Thanos sidecar:

  • In order to use Thanos sidecar upload, these two values have to be equal --storage.tsdb.min-block-duration and --storage.tsdb.max-block-duration by default, they are set it to 2 hours.
  • The retention of Prometheus is recommended to not be lower than three times the min block duration, so 6 hours.
  • Additional explanation can be found here

Step 6:

Enable Config secret — via enabling Object Storage Config we can write data to S3 or any other BlockDevice that is supported. to ensure persistence to our long-term data.

While the source file thanos-storage-config.yamlhave to be in this form,

Worth mentioning that currently, we can use only a single S3 bucket (ObjectStore)

Create the secret using the following command :
kubectl -n monitoring create secret generic thanos-objstore-config --from-file=thanos.yaml=thanos-storage-config.yaml

Step 7:

Now we can install/upgrade the helm chart with our relevant customizations.

helm install kube-prometheus -f values.yaml bitnami/kube-prometheus -n monitoring

or
helm upgrade kube-prometheus -f values.yaml bitnami/kube-prometheus -n monitoring

If you made it until here, you should have by now running Prometheus pods with Thanos sidecar containers that from the one hand send the scraped data via GRPC to the manifest and from the other hand the same sidecar send (after ~2 hours) the data to S3 bucket (config-store). Congrats!

Second Phase

We focus on how to deploy and configure Thanos on the main observability cluster. As mentioned before, it will be responsible to collect all the data from all the clusters that we deployed in the first phase.

For that, we use kube-thanos manifests. We found that for our purpose we need to implement only the query and the store parts.

Step 1:

Installing and customizing kube-thanos:
Create a namespace named thanos in the main observability cluster:
kubectl create ns thanos

You can choose to clone kube-thanos repository and use the manifest folder or to compile the kube-thanos manifests yourself. The last doesn’t require you to make a copy of this entire repository, only the manifest files.
The full instructions you can find in Thanos README.md

Step 2:

After you have passed the first stage, we will take care of the communication between thanos-query-deployment.yamlto the other cluster from the first phase. For that, we need to add this:
- --store=dnssrv+_grpc._tcp.thanos-prometheus-<cluster_name>.<domain_name>:10901

into the args section with Thanos sidecar GRPC endpoint we exposed and defined above (step 4).

Step 3:

Now, we’ll take care of communication between thanos-store to the S3 bucket (ObjectStore) that we configured data to be sent to from the first phase. So, as we did in the first step, we need to configure a secret with a name as requested in thanos-store-statefulSet.yamlpart of the injected environment to the Thanos store pods:

we can then reuse the same source file from the first phase thanos-storage-config.yaml and create a secret to thanos-store:

kubectl -n thanos create secret generic thanos-objectstorage --from-file=thanos.yaml=thanos-storage-config.yaml

Step 4:

Install Thanos manifests:
kubectl apply -f manifests -n thanos

Now, the cycle should be closed. Thanos receives the real-time data from the other clusters through thanos-query deployment and retains data from the S3 bucket (ObjectStore) through thanos-store-statefulSet.

Conclusion

  • Thanos made us change our perception about making Prometheus highly available, durable, and cost-efficient
  • Implementing Thanos and Prometheus on many Kubernetes clusters will require a lot of effort but it’s worthwhile if you care about ensuring a highly available Prometheus.
  • Personally, It was one of the most challenging projects I’ve ever had. I wish I had this article when I started it.

I want to thank Yossi Cohn and Regev Golan who without their help, none of the above wouldn’t happened.
I hope you enjoyed reading this article and hopefully, it has inspired you to rethink and improve your monitoring stack ✌️ & ❤️

--

--