HA Kubernetes Monitoring using Prometheus and Thanos

MetricFire

Published in

The MetricFire Blog

4 min readJan 6, 2020

Introduction
Why Integrate Prometheus with Thanos?
Thanos Overview
3.1 Thanos architecture
3.2 Thanos Sidecar
3.3 Thanos Store
3.4 Thanos Query
3.5 Thanos Compact
3.6 Thanos Ruler

4. Thanos Configuration

5. Deployment

6. Grafana Dashboards

7. Conclusion

1. Introduction

In this article, we will deploy a clustered Prometheus setup that integrates Thanos. It is resilient against node failures and ensures appropriate data archiving. The setup is also scalable. It can span multiple Kubernetes clusters under the same monitoring umbrella. Finally, we will visualize and monitor all our data in accessible and beautiful Grafana dashboards. For better code formatting, check out this same article on the MetricFire blog.

2. Why Integrate Prometheus with Thanos?

Prometheus is scaled using a federated set-up, and its deployments use a persistent volume for the pod. However, not all data can be aggregated using federated mechanisms. Often, you need a different tool to manage Prometheus configurations. To address these issues, we will use Thanos. Thanos allows you to create multiple instances of Prometheus, deduplicate data, and archive data in long-term storage like GCS or S3.

3. Thanos Overview

3.1 Thanos Architecture

The components of Thanos are sidecar, store, query, compact, and ruler. Let’s take a look at what each one does.

3.2 Thanos Sidecar

The main component that runs along Prometheus
Reads and archives data on the object store
Manages Prometheus’s configuration and lifecycle
Injects external labels into the Prometheus configuration to distinguish each Prometheus instance
Can run queries on Prometheus servers’ PromQL interfaces
Listens in on Thanos gRPC protocol and translates queries between gRPC and REST

3.3 Thanos Store

Implements the Store API on top of historical data in an object storage bucket
Acts primarily as an API gateway and therefore does not need significant amounts of local disk space
Joins a Thanos cluster on startup and advertises the data it can access
Keeps a small amount of information about all remote blocks on a local disk in sync with the bucket
This data is generally safe to delete across restarts at the cost of increased startup times

3.4 Thanos Query

Listens in on HTTP and translates queries to Thanos gRPC format
Aggregates the query result from different sources, and can read data from Sidecar and Store
In HA setup, Thanos Query even deduplicates the result

A note on run-time duplication of HA groups: Prometheus is stateful and does not allow for replication of its database. Therefore, it is not easy to increase high availability by running multiple Prometheus replicas.

Simple load balancing will also not work — say your app crashes. The replica might be up, but querying it will result in a small time gap for the period during which it was down. This isn’t fixed by having a second replica because it could be down at any moment, for example, during a rolling restart. These instances show how load balancing can fail.

Thanos Query pulls the data from both replicas and deduplicates those signals, filling the gaps, if any, to the Querier consumer.

3.5 Thanos Compact

Applies the compaction procedure of the Prometheus 2.0 storage engine to block data in object storage
Generally not concurrent with safe semantics and must be deployed as a singleton against a bucket
Responsible for downsampling data: 5 minute downsampling after 40 hours and 1 hour downsampling after 10 days

3.6 Thanos Ruler

Thanos Ruler basically does the same thing as the querier but for Prometheus’ rules. The only difference is that it can communicate with Thanos components.‍

4. Thanos Implementation

Prerequisites: In order to completely understand this tutorial, the following are needed:

Working knowledge of Kubernetes and kubectl
A running Kubernetes cluster with at least 3 nodes (We will use a GKE)
Implementing Ingress Controller and Ingress objects (We will use Nginx Ingress Controller); although this is not mandatory, it is highly recommended in order to reduce external endpoints.
Creating credentials to be used by Thanos components to access object store (in this case, GCS bucket)
a. Create 2 GCS buckets and name them as prometheus-long-term and thanos-ruler
b. Create a service account with the role as Storage Object Admin
c. Download the key file as json credentials and name it thanos-gcs-credentials.json
d. Create a Kubernetes secret using the credentials, as you can see in the following snippet:

kubectl create secret generic thanos-gcs-credentials      
--from-file=thanos-gcs-credentials.json -n monitoring

5. Deployment

Deploying Prometheus Services Accounts, Clusterrole and Clusterrolebinding: The following manifest creates the monitoring namespace, service accounts, clusterrole and clusterrolebindings needed by Prometheus.

For all of the deployment code, read the rest of this article on the MetricFire website.

HA Kubernetes Monitoring using Prometheus and Thanos

Table of Contents

1. Introduction

2. Why Integrate Prometheus with Thanos?

3. Thanos Overview

3.1 Thanos Architecture

3.2 Thanos Sidecar

3.3 Thanos Store

3.4 Thanos Query

3.5 Thanos Compact

3.6 Thanos Ruler

4. Thanos Implementation

5. Deployment

Written by MetricFire