High Availability Kubernetes Monitoring using Prometheus and Thanos

Introduction

appfleet team
appfleet
18 min readMar 4, 2020

--

The need for Prometheus High Availability

Kubernetes adoption has grown multifold in the past few months and it is now clear that Kubernetes is the defacto for container orchestration. That being said, Prometheus is also considered an excellent choice for monitoring both containerized and non-containerized workloads. Monitoring is an essential aspect of any infrastructure, and we should make sure that our monitoring set-up is highly-available and highly-scalable in order to match the needs of an ever growing infrastructure, especially in the case of Kubernetes.

Therefore, today we will deploy a clustered Prometheus set-up which is not only resilient to node failures, but also ensures appropriate data archiving for future references. Our set-up is also very scalable, to the extent that we can span multiple Kubernetes clusters under the same monitoring umbrella.

Present scenario

Majority of Prometheus deployments use persistent volume for pods, while Prometheus is scaled using a federated set-up. However, not all data can be aggregated using a federated mechanism, where you often need a mechanism to manage Prometheus configuration when you add additional servers.

The Solution

Thanos aims at solving the above problems. With the help of Thanos, we can not only multiply instances of Prometheus and de-duplicate data across them, but also archive data in a long term storage such as GCS or S3.

Implementation

Thanos Architecture

Image Source: https://thanos.io/quick-tutorial.md/

Thanos consists of the following components:

  • Thanos Sidecar: This is the main component that runs along Prometheus. It reads and archives data on the object store. Moreover, it manages Prometheus’ configuration and lifecycle. To distinguish each Prometheus instance, the sidecar component injects external labels into the Prometheus configuration. This component is capable of running queries on Prometheus servers’ PromQL interface. Sidecar components also listen on Thanos gRPC protocol and translate queries between gRPC and REST.
  • Thanos Store: This component implements the Store API on top of historical data in an object storage bucket. It acts primarily as an API gateway and therefore does not need significant amounts of local disk space. It joins a Thanos cluster on startup and advertises the data it can access. It keeps a small amount of information about all remote blocks on local disk and keeps it in-sync with the bucket. This data is generally safe to delete across restarts at the cost of increased startup times.
  • Thanos Query: The Query component listens on HTTP and translates queries to Thanos gRPC format. It aggregates the query result from different sources, and can read data from Sidecar and Store. In a HA setup, it even deduplicates the result.

Run-time deduplication of HA groups

Prometheus is stateful and does not allow replicating its database. This means that increasing high-availability by running multiple Prometheus replicas are not very easy to use. Simple load balancing will not work, as for example after some crash, a replica might be up but querying such replica will result in a small gap during the period it was down. You have a second replica that maybe was up, but it could be down in another moment (e.g rolling restart), so load balancing on top of those will not work well.

  • Thanos Querier instead pulls data from both replicas, and deduplicate those signals, filling the gaps if any, transparently to the Querier consumer.
  • Thanos Compact: The compactor component of Thanos applies the compaction procedure of the Prometheus 2.0 storage engine to block data stored in object storage. It is generally not semantically concurrency safe and must be deployed as a singleton against a bucket.
    It is also responsible for downsampling of data — performing 5m downsampling after 40 hours and 1h downsampling after 10 days.
  • Thanos Ruler: It basically does the same thing as Prometheus’ rules. The only difference is that it can communicate with Thanos components.

Configuration

Prerequisite

In order to completely understand this tutorial, the following are needed:

  1. Working knowledge of Kubernetes and using kubectl
  2. A running Kubernetes cluster with at least 3 nodes (for the purpose of this demo a GKE cluster is being used)
  3. Implementing Ingress Controller and ingress objects (for the purpose of this demo Nginx Ingress Controller is being used). Although this is not mandatory but it is highly recommended inorder to decrease the number of external endpoints created.
  4. Creating credentials to be used by Thanos components to access object store (in this case GCS bucket)
  5. Create 2 GCS buckets and name them as prometheus-long-term and thanos-ruler
  6. Create a service account with the role as Storage Object Admin
  7. Download the key file as json credentials and name it as thanos-gcs-credentials.json
  8. Create kubernetes secret using the credentials
    kubectl create secret generic thanos-gcs-credentials --from-file=thanos-gcs-credentials.json -n monitoring

Deploying various components

Deploying Prometheus Services Accounts, Clusterrole and Clusterrolebinding

The above manifest creates the monitoring namespace and service accounts, clusterrole and clusterrolebinding needed by Prometheus.

Deploying Prometheus Configuration configmap

The above Configmap creates Prometheus configuration file template. This configuration file template will be read by the Thanos sidecar component and it will generate the actual configuration file, which will in turn be consumed by the Prometheus container running in the same pod. It is extremely important to add the external_labels section in the config file so that the Querier can deduplicate data based on that.

Deploying Prometheus Rules configmap
This will create our alert rules which will be relayed to alertmanager for delivery

Deploying Prometheus Stateful Set

It is important to understand the following about the manifest provided above:

  1. Prometheus is deployed as a stateful set with 3 replicas and each replica provisions its own persistent volume dynamically.
  2. Prometheus configuration is generated by the Thanos sidecar container using the template file we created above.
  3. Thanos handles data compaction and therefore we need to set — storage.tsdb.min-block-duration=2h and — storage.tsdb.max-block-duration=2h
  4. Prometheus stateful set is labelled as thanos-store-api: true so that each pod gets discovered by the headless service, which we will create next. It is this headless service which will be used by the Thanos Querier to query data across all Prometheus instances. We also apply the same label to the Thanos Store and Thanos Ruler component so that they are also discovered by the Querier and can be used for querying metrics.
  5. GCS bucket credentials path is provided using the GOOGLE_APPLICATION_CREDENTIALS environment variable, and the configuration file is mounted to it from the secret which we created as a part of prerequisites.

Deploying Prometheus Services

We create different services for each Prometheus pod in the stateful set, although it is not needed. These are created only for debugging purposes. The purpose of thanos-store-gateway headless service has been explained above. We will later expose Prometheus services using an ingress object.

Deploying Thanos Querier

This is one of the main components of Thanos deployment. Note the following:

  1. The container argument — store=dnssrv+thanos-store-gateway:10901 helps to discover all components from which metric data should be queried.
  2. The service thanos-querier provided a web interface to run PromQL queries. It also has the option to de-duplicate data across various Prometheus clusters.
  3. This is the end point where we provide Grafana as a datasource for all dashboards.

Deploying Thanos Store Gateway

This will create the store component which serves metrics from object storage to the Querier.

Deploying Thanos Ruler

Now if you fire-up on interactive shell in the same namespace as our workloads, and try to see to which all pods does our thanos-store-gateway resolves, you will see something like this:

The IP’s returned above correspond to our Prometheus pods, thanos-store and thanos-ruler. This can be verified as

Deploying Alertmanager

This will create our alertmanager deployment which will deliver all alerts generated as per Prometheus rules.

Deploying Kubestate Metrics

Kubestate metrics deployment is needed to relay some important container metrics which are not natively exposed by the kubelet and hence are not directly available to Prometheus.

Deploying Node-Exporter Daemonset

Node-Exporter daemonset runs a pod of node-exporter on each node and exposes very important node related metrics which can be pulled by Prometheus instances.
Deploying Grafana

This will create our Grafana Deployment and Service which will be exposed using our Ingress Object. We should add Thanos-Querier as the datasource for our Grafana deployment. In order to do so:

  1. Click on Add DataSource
  2. Set Name: DS_PROMETHEUS
  3. Set Type: Prometheus
  4. Set URL: http://thanos-querier:9090
  5. Save and Test. You can now build your custom dashboards or simply import dashboards from grafana.net. Dashboard #315 and #1471 are good to start with.

Deploying the Ingress Object

This is the final piece in the puzzle. This will help expose all our services outside the Kubernetes cluster and help us access them. Make sure you replace <yourdomain> with a domain name which is accessible to you and you can point the Ingress-Controller’s service to.

You should now be able to access Thanos Querier at http://thanos-querier.<yourdomain>.com . It will look something like this:

Make sure deduplication is selected.

If you click on Stores all the active endpoints discovered by thanos-store-gateway service can be seen

Now you add Thanos Querier as the datasource in Grafana and start creating dashboards

Kubernetes Cluster Monitoring Dashboard

Kubernetes Node Monitoring Dashboard

Conclusion

Integrating Thanos with Prometheus definitely provides the ability to scale Prometheus horizontally, and also since Thanos-Querier is able to pull metrics from other querier instances, you can practically pull metrics across clusters visualize them in a single dashboard.

We are also able to archive metric data in an object store that provides infinite storage to our monitoring system along with serving metrics from the object storage itself. A major part of cost for this set-up can be attributed to the object storage (S3 or GCS). This can be further reduced if we apply appropriate retention policies to them.

However, achieving all this requires quite a bit of configuration on your part. The manifests provided above have been tested in a production environment. Feel free to reach out should you have any questions around them.

--

--

appfleet team
appfleet

appfleet is a cloud platform offering edge compute for containers and web applications.