Monitoring multiple OKE clusters with Prometheus, Thanos and Grafana — Part 2

Ali Mukadam
Oracle Developers
Published in
6 min readDec 2, 2021

In Part 1, we looked at some of the reasons we want to use Thanos, a highly available solution with long term storage capabilities for Prometheus. We also deployed all the components of Thanos in our admin Verrazzano cluster. However, we are getting the metrics for only 1 cluster. In this article, we will look at how we can monitor multiple clusters.

Recall that:

Effectively, this makes the Singapore cluster our command center:

We now want to be to monitor the other clusters too. To do so, we are going to install Prometheus with the Thanos sidecar in each region.

Multi-cluster, multi-region Thanos deployment

First, create a bucket called thanos in each region.

On the operator host, make 3 copies of the thanos-sin-storage.yaml and rename them appropriately by region e.g. thanos-syd-storage.yaml, thanos-mum-storage.yaml, thanos-tok-storage.yaml. Edit each file and change their respective object storage endoint and region. The object storage endpoint has the following format:

<object_storage_namespace>.compat.objectstorage.region.oraclecloud.com

Recall that we had also installed kubectx and for our multi-cluster purpose, we had equated 1 cluster to 1 context. For each of the managed clusters, repeat the following:

## # change name of context everytime
kubectx sydney
kubectl create ns monitoring## change file name everytime
kubectl -n monitoring create secret generic thanos-objstore-config --from-file=thanos-syd-storage.yaml=thanos-syd-storage.yaml

Deploy Prometheus with Sidecar

Next, we will deploy Prometheus with the sidecar in each region. As before, create a copy of the prometheusvalues.yaml file for each cluster e.g. prometheusvalues-sydney.yaml etc. Edit each file and change the following:

prometheus.thanos.objectStorageConfig.secretKey: thanos-sin-storage.yaml
prometheus.thanos.service.type: ClusterIP
prometheus.thanos.service.annotations: {}
prometheus.externalLabels:
cluster: "syd"

You can now deploy Prometheus in each region. Remember to change the context and file every time:

kubectx sydneyhelm install prometheus bitnami/kube-prometheus \
--namespace monitoring \
-f prometheusvalues-sydney.yaml

Verify that all the Prometheus pods are running properly in each region:

kubectl -n monitoring get pods

Next, deploy Thanos in all “managed” regions. First, make a copy of the thanosvalues.yaml we created in Part 1, then ensure you update the following parameters:

objstoreConfig
## update the endpoint, and region
query.service.type: LoadBalancer
query.service.annotations:
oci.oraclecloud.com/oci-network-security-groups: "nsg_id"
service.beta.kubernetes.io/oci-load-balancer-shape: "flexible"
service.beta.kubernetes.io/oci-load-balancer-shape-flex-min: "50"
service.beta.kubernetes.io/oci-load-balancer-shape-flex-max: "100"
service.beta.kubernetes.io/oci-load-balancer-subnet1: "subnet_id"
service.beta.kubernetes.io/oci-load-balancer-internal: "true" service.beta.kubernetes.io/oci-load-balancer-security-list-management-mode: "All"

You can obtain the nsg_id and subnet_id values from the OCI console. Ensure you get their values for each region. Deploy Thanos in each region:

kubectx sydneyhelm install thanos bitnami/thanos \
--namespace monitoring \
-f thanos-syd.yaml

Verify that all Thanos pods are running correctly in each region:

kubectl -n monitoring get pods

Updating Prometheus and Thanos in Admin region

With the above architecture, we no longer need to expose the sidecar in the admin region as a LoadBalancer service. Instead, set the query.service.type to ClusterIP and comment/remove the annotations. We can update the list of stores as well (note that I now have only 2 managed regions because I messed up the Tokyo cluster while testing something else):

stores:
- 10.1.2.5:10901
- 10.2.2.28:10901

Change the context to admin and run helm update:

kubectx adminhelm upgrade thanos bitnami/thanos -f thanosvalues.yaml

If you access Thanos Query, you can now see 2 queries, 1 store and no sidecar:

Let’s use Thanos to find the amount of memory allocated and still in use by each cluster. We run a query and then we use the externalLabels we set in each cluster:

Let’s look at Grafana. This time we will use the Cluster Detail Dashboard (id: 10856). Import it as before and use the Thanos data source and access the dashboard. It will show as empty.

The reason is because the dashboard is showing the metrics of the last 30 mins and the data have not been stored to object storage. Change the time range and you might need to wait for at least 2 hours until the first data have been written to OCI Object storage:

Add a region filter to Grafana dashboard

We now also want to be able to examine a specific cluster:

  1. Click on the Dashboard settings, then Variables on left menu.
  2. Click New to create a new variable
  3. Name it as cluster
  4. Set the type to Ad hoc filters
  5. And clik Update, followed by Save Dashboard

You will now see a filter at the top of the Dashboard:

Click on the + icon, select cluster and then select the cluster you wish to inspect:

The values here will be those you set in the externalLabel parameter for each cluster. After you select 1 cluster, you should see the values in the various panels change:

Updated Grafana values after selecting a cluster

Summary

Now, we can monitor the performance of various resources in OCI across many regions, VCNs and even tenancies simultaneously. This exercise also helped me understand Thanos considerably better and I have come to realize that this is 1 of many variations when deploying Thanos as a long term and high availability solution for Prometheus. Each variation has its advantages and disadvantages with possible regulatory implications (if you need to conform to these) necessitating infrastructural, architectural and financial tradeoffs. In a future post, we will hopefully look at these.

I want to thank my colleagues Shaun Levey and Venugopal Naik for their thoughtful suggestions and ideas.

Related literature:

--

--