HA Kubernetes Monitoring using Prometheus and Thanos

MetricFire
The MetricFire Blog
4 min readJan 6, 2020

--

Table of Contents

  1. Introduction
  2. Why Integrate Prometheus with Thanos?
  3. Thanos Overview
    3.1 Thanos architecture
    3.2 Thanos Sidecar
    3.3 Thanos Store
    3.4 Thanos Query
    3.5 Thanos Compact
    3.6 Thanos Ruler

4. Thanos Configuration

5. Deployment

6. Grafana Dashboards

7. Conclusion

1. Introduction

In this article, we will deploy a clustered Prometheus setup that integrates Thanos. It is resilient against node failures and ensures appropriate data archiving. The setup is also scalable. It can span multiple Kubernetes clusters under the same monitoring umbrella. Finally, we will visualize and monitor all our data in accessible and beautiful Grafana dashboards. For better code formatting, check out this same article on the MetricFire blog.

2. Why Integrate Prometheus with Thanos?

Prometheus is scaled using a federated set-up, and its deployments use a persistent volume for the pod. However, not all data can be aggregated using federated mechanisms. Often, you need a different tool to manage Prometheus configurations. To address these issues, we will use Thanos. Thanos allows you to create multiple instances of Prometheus, deduplicate data, and archive data in long-term storage like GCS or S3.

3. Thanos Overview

3.1 Thanos Architecture

The components of Thanos are sidecar, store, query, compact, and ruler. Let’s take a look at what each one does.

3.2 Thanos Sidecar

  • The main component that runs along Prometheus
  • Reads and archives data on the object store
  • Manages Prometheus’s configuration and lifecycle
  • Injects external labels into the Prometheus configuration to distinguish each Prometheus instance
  • Can run queries on Prometheus servers’ PromQL interfaces
  • Listens in on Thanos gRPC protocol and translates queries between gRPC and REST

3.3 Thanos Store

  • Implements the Store API on top of historical data in an object storage bucket
  • Acts primarily as an API gateway and therefore does not need significant amounts of local disk space
  • Joins a Thanos cluster on startup and advertises the data it can access
  • Keeps a small amount of information about all remote blocks on a local disk in sync with the bucket
  • This data is generally safe to delete across restarts at the cost of increased startup times

3.4 Thanos Query

  • Listens in on HTTP and translates queries to Thanos gRPC format
  • Aggregates the query result from different sources, and can read data from Sidecar and Store
  • In HA setup, Thanos Query even deduplicates the result

A note on run-time duplication of HA groups: Prometheus is stateful and does not allow for replication of its database. Therefore, it is not easy to increase high availability by running multiple Prometheus replicas.

Simple load balancing will also not work — say your app crashes. The replica might be up, but querying it will result in a small time gap for the period during which it was down. This isn’t fixed by having a second replica because it could be down at any moment, for example, during a rolling restart. These instances show how load balancing can fail.

Thanos Query pulls the data from both replicas and deduplicates those signals, filling the gaps, if any, to the Querier consumer.

3.5 Thanos Compact

  • Applies the compaction procedure of the Prometheus 2.0 storage engine to block data in object storage
  • Generally not concurrent with safe semantics and must be deployed as a singleton against a bucket
  • Responsible for downsampling data: 5 minute downsampling after 40 hours and 1 hour downsampling after 10 days

3.6 Thanos Ruler

Thanos Ruler basically does the same thing as the querier but for Prometheus’ rules. The only difference is that it can communicate with Thanos components.‍

4. Thanos Implementation

Prerequisites: In order to completely understand this tutorial, the following are needed:

  1. Working knowledge of Kubernetes and kubectl
  2. A running Kubernetes cluster with at least 3 nodes (We will use a GKE)
  3. Implementing Ingress Controller and Ingress objects (We will use Nginx Ingress Controller); although this is not mandatory, it is highly recommended in order to reduce external endpoints.
  4. Creating credentials to be used by Thanos components to access object store (in this case, GCS bucket)
    a. Create 2 GCS buckets and name them as prometheus-long-term and thanos-ruler
    b. Create a service account with the role as Storage Object Admin
    c. Download the key file as json credentials and name it thanos-gcs-credentials.json
    d. Create a Kubernetes secret using the credentials, as you can see in the following snippet:
kubectl create secret generic thanos-gcs-credentials      
--from-file=thanos-gcs-credentials.json -n monitoring

5. Deployment

Deploying Prometheus Services Accounts, Clusterrole and Clusterrolebinding: The following manifest creates the monitoring namespace, service accounts, clusterrole and clusterrolebindings needed by Prometheus.

For all of the deployment code, read the rest of this article on the MetricFire website.

--

--

MetricFire
The MetricFire Blog

Time series monitoring as a service using Graphite and visualized on Grafana.