Monitoring Kubernetes Workloads with Thanos and Prometheus Operator
Are your applications running on Kubernetes? Is it highly scalable and you are happy with the way it works? Wait a minute, How are you monitoring them? Ahh, Prometheus Right? Cool, Did you ever wonder how scalable and Highly available your Prometheus Cluster Is? Before that, here is a mail from your boss asking you to find out the number of http_requests that your website received last Xmas or Let's make this the Indian Style. Your boss wants to know the number of customers who had visited your website ( total number of http_requests ) the last Sankranthi ( An year ago ). Now you tried accessing your Prometheus / Grafana servers. You just realized that the metrics are not found. What do you tell your boss now? Well before this situation actually arises let us try to fix this by using Thanos. Thanos is a tool to set up a Highly Available Prometheus with long-term storage capabilities. Thanos is Open Source and is a CNCF Incubating Project. The features of Thanos are
- Unlimited retention of Prometheus metrics within the Supported Object stores like GCS, S3, Azure Blob, Swift, and Tencent COS.
- Global Query view helps us to view the metrics from multiple Prometheus Instances spawned across various namespaces and various clusters.
- It is compatible with your existing monitoring tools like Prometheus and Grafana.
- Downsample historical data for massive query speedup when querying large time ranges or configure complex retention policies.
What is the entire story all about? (TLDR)
- Getting to know Thanos Components.
- Implementing HA-Prometheus with Thanos, Prometheus Operator, and GCS ( Object Store ).
- Basic understanding of Prometheus.
- GitHub Link: https://github.com/pavan-kumar-99/medium-manifests
- GitHub Branch: thanos
Understanding Thanos Components
Before we start, let us first understand Thanos's components in detail. When I was initially trying to study Thanos, I really had a hard time understanding how Thanos works and the components needed for Thanos to be fully functional, and the role of each component. So let us demystify each component in detail and understand their usage with a very useful architecture diagram from the official website of Thanos.
a) Thanos Sidecar
Thanos Sidecar is deployed as a sidecar container to the Prometheus Pod. [Sidecar containers are the containers that should run along with the main container in the pod]. This is one of the components that interact with your Object storage ( i.e. S3, GCS, Azure Blob, etc ). It is responsible for uploading TSBD blocks to the object storage. The blocks that are produced by Prometheus every two hours are uploaded ( once every two hours ) to the Object storage by Thanos Sidecar. Let me show you the logs of a sample Thanos sidecar container uploading TSBD blocks to GCS.
b) Thanos Querier / Query
The Thanos Querier / Query is a stateless component that implements Prometheus HTTP v1 API to query data in a Thanos cluster. It gathers the data needed to evaluate a PromQL query from the underlying store APIs via the gRPC protocol. The store can be either be one of the data sources that implement the gRPC store API.
- Prometheus ( Thanos sidecar enabled via headless service discovery ).
- From Object storage like S3, GCS via Store Gateway.
- Another Thanos Querier ( Can be from a different cluster ).
Thanos Querier UI showing the various stores ( That were discovered through Prometheus Sidecar and via another store from a different cluster ).
c) Thanos Query Frontend
The Thanos Query frontend is a service that is put in front of Thanos querier to improve the read path. It helps us in splitting a long query into multiple short queries based. This helps in better parallelization of the query and also helps in better load balancing of the queries. This also helps in caching the query and improves the efficiency of the longer queries. Currently, in-memory cache (FIFO cache) and Memcached are supported.
This is just similar to the Thanos querier UI. But enables features like Query Splitting and Caching of queries.
d) Thanos Store Gateway ( Thanos Store )
Thanos Store Gateway acts as an API Gateway between your Thanos cluster and the Object store. This is one of the components that require access to your Object storage. It implements the Store API on top of historical data in an object storage bucket. It keeps a small amount of information about all remote blocks on the local disk and keeps it in sync with the bucket.
e) Thanos Compactor ( Compactor )
As we know that Prometheus periodically compacts the blocks of data to improve query efficiency. In the same way, the compactor scans the Objects stored in Object Storage ( Like AWS S3, GCS, Azure Blob, etc ) and applies compaction wherever necessary. This component also helps in downsampling the data to increase the query efficiency for larger blocks of data.
By default, compact will run to completion once it compacts the objects. For this to run indefinitely make sure to add the flag — wait while running this compactor. It must be deployed as a singleton against a bucket. The compactor usually needs 100–300GB of local data for processing the data locally. In ideal cases, 50–70GB of data would suffice unless your metrics are really huge.
f) Thanos Ruler
The Ruler evaluates Prometheus recording and alerting rules against chosen query API. You can think of Rule as a simplified Prometheus that does not require a sidecar and does not scrape and do PromQL evaluation (no QueryAPI).
g) Thanos Receive
Thanos receive implements the Prometheus Remote Write API. The Thanos Sidecar is not sufficient for this, as the system would always lag the block length behind (typically 2 hours). Read more about Thanos Receive here.
h) Thanos Tools
Thanos tools are additional tools that provide additional capabilities and tools compared with the other Thanos components. A few of them are
1) Thanos tools bucket web: This is used to inspect bucket blocks from a Web UI.
2) Thanos tools bucket ls: This is used to list all blocks in the specified bucket.
3) Thanos tools bucket replicate: This is used to replicate buckets from one object storage to another.
Well, these are the various components of Thanos in detail. In the Part2 of this article, we will perform a hands-on to explore the various components of Thanos and Integrate Thanos with Prometheus and Grafana.
These are the various components that are present in Thanos. In the Part2 of this article, I have explained how to set up an HA-Prometheus with Thanos sidecar pushing the TSBD blocks to the GCS bucket. I have also explained how to Install a Thanos cluster using Bitnami’s Thanos Helm chart.
Until next time…..
Deep Dive into Thanos-Part II
Monitoring Kubernetes Workloads with Thanos and Prometheus Operator
AutoScaling in Kubernetes ( HPA / VPA )
Autoscale your applications in Kubernetes using Vertical Pod Autoscaler ( VPA ) and Horizontal Pod Autoscaler ( HPA )
Creating Self Hosted GitHub runners in a Kubernetes Cluster
Run your GitHub actions on your own Kubernetes cluster