Kubernetes monitoring with Verrazzano, Prometheus, Thanos and Workload Identity on Oracle Cloud

Published in

Oracle Developers

6 min readSep 27, 2023

I have written several times about using Prometheus and Thanos to to monitor OKE (Kubernetes) before. In this article, I’ll chart our journey so far of adding support to Thanos in OCI.

Step 0: Using the S3 interface in OCI Object Storage to store metrics

The very first iteration required using the S3 interface so that Thanos could read and write TSDB blocks to OCI Object Storage. It felt dirty and due to having to embed the access and secret keys in the Object Store config, many users were understandably not particularly comfortable with it. Nevertheless, it was effective and we could use it to monitor multiple clusters too.

Step 1: Making OCI Object Storage a 1st class citizen among Thanos storage providers

Our next step was to make OCI Object Storage supported as a first class citizen in Thanos. Doing so allowed us to support both key-based authentication as well as instance principal. In key-based authentication, you would create a Secret to hold the authentication information:

type: OCI
config:
  provider: "raw"
  bucket: ${bucket_name}
  compartment_ocid: "ocid1.compartment.oc1....."
  region: ${region}
  tenancy_ocid: "ocid1.tenancy.oc1....."
  user_ocid: "ocid1.user.oc1....."
  fingerprint: "12:d3:4c:..."
  privatekey: |
    -----BEGIN RSA PRIVATE KEY-----
    <replaceme>
    -----END RSA PRIVATE KEY-----

While it doesn’t address the security aspect, this brought OCI Object Storage Support to the same level as S3. Importantly, Thanos had also added the Receive model. As a refresher, Thanos has 2 deployment models:

The Sidecar:

In the Sidecar model, the Thanos sidecar runs in the Prometheus Pod and uploads the TSDB to the selected object storage.

2. The Receiver:

In the Receive model, Prometheus writes the metrics continuously to the Thanos Receiver which then uploads to Object Storage.

Which one is most appropriate for your cluster depends on several parameters and this article does a very good comparison of both approaches. Using the Receive method and key-based authentication, it is now also possible for remote Kubernetes clusters running outside of OCI, say in edge environments, to ship their metrics directly to OCI Object Storage. There are other variations to support remote clusters (e.g. running a Receiver on OKE) but our focus in this article is on the Object Storage integration.

Step 2: Adding instance principal support

As we now have native OCI Object Storage support, we can also implement instance principal authentication. Instance principal works by making some compute instances to be part of an OCI dynamic group and creating policies to pre-authenticate them to work with OCI services. In both cases, you must ensure the minimum level of privileges i.e. the user or the dynamic group must only be able to use OCI Object Storage only and read and write specific buckets.

However, in Kubernetes, by default you do not control where the pods are scheduled and in order for Thanos to work using instance principal, you must ensure it lands on worker nodes that are part of the dynamic group. I’ve detailed all the steps you need to take in this previous article. Suffice to say that it requires some planning and you need to know:

OCI IAM
Kubernetes and OKE
Thanos
And the intersection of the above 3.

Step 3: Adding OKE Workload Identity to OKE

The key-based method is dead simple but requires storing the credentials in a secret. Even with limited access, this is something that many users baulk at. The instance principal method does not use the key but it takes a fair bit of understanding of 3 different systems, the area they overlap and elaborate planning.

What our users were looking for was the best of both world:

simplicity and ease-of-use
security without storing the key

Step forward OKE Workload Identity. Announced in March, it allows a workload running in OKE to use OCI Services using service accounts by assigning access at the pod level. As such, it’s considerably simpler yet at the same time more secure and cost effective as now, you do not need to run a dedicated node pool just to attain the necessary permissions and isolation with the caveat that you need OKE Enhanced Clusters in order to use it.

OKE Enhanced Clusters

What are OKE Enhanced Clusters you ask me? On OCI, the control plane of the basic OKE cluster is free. Yep, we charge you exactly $0.00 for the privilege of managing your Kubernetes control plane.

You only get charged for what you use:

compute for worker nodes
storage
network resources such as VCN, Load Balancers
and network egress (which is generous and considerably cheaper than in other cloud providers):

Source: https://www.linkedin.com/posts/justinfsmith_the-numbers-speak-for-themselves-according-activity-7042547072130027520-qZnT

OKE Basic clusters allows you to run up to 1000 nodes in the Kubernetes cluster.

On the other hand, OKE Enhanced clusters allows you to run more than 1000 nodes as well as use productivity features such as OKE Workload Identity, Addons and Virtual Nodes (aka Serverless OKE). All this for a cool $0.10 per cluster per hour.

Step 4: Adding OKE Workload Identity support to Thanos

Now that OKE has support for Workload Identity, we can add support for this authentication method to Thanos. This magical piece of work was done and contributed to upstream by my colleague Fred Tibbitts. After further considerable testing, we’ve now added it to Verrazzano too and released in v1.6.7.

Let’s take it for a spin with the Terraform module for Verrazzano.

Testing Thanos with the Terraform module

First, clone the module:

git clone https://github.com/oracle-terraform-modules/terraform-oci-verrazzano.git

Follow the instructions to create OKE clusters in multiple regions: https://oracle-terraform-modules.github.io/terraform-oci-verrazzano/multi/pub-ep.html. Remember to set the cluster type to Enhanced:

cluster_type = "enhanced"

Before generating the scripts to install Verrazzano, set the following values:

get_kubeconfigs       = true
install_verrazzano    = true
verrazzano_version    = "1.6.7"
grafana               = true
prometheus            = true
prometheus_operator   = true
rancher               = true
thanos = {
  bucket_name      = "thanos"
  bucket_namespace = "<replaceme>"
  enabled          = "true"
  integration      = "sidecar"
  storage_gateway  = "true"
}

Run Terraform apply again to upload the scripts to the operator host. ssh to the operator to perform the installation. First, install the Verrazzano Platform Operator (my managed cluster is in the Melbourne region):

cd /home/opc/vz/operator

for cluster in admin melbourne ; do
  bash install_vz_operator_$cluster.sh
done

Before installing, create the Thanos Object Storage configuration:

cd /home/opc/vz/clusters
for cluster in admin melbourne; do
  kubectx $cluster
  kubectl create namespace verrazzano-monitoring
  kubectl create secret generic objstore-config -n verrazzano-monitoring --from-file=objstore.yml=thanos_${cluster}_storage.yaml
done

If you inspect the content of the store file config:

type: OCI
config:
  provider: "oke-workload-identity"
  bucket: dev-thanos
  region: "ap-sydney-1"

Install the Admin cluster:

cd /home/opc/vz/clusters
bash install_vz_cluster_admin.sh

Followed by the managed cluster:

cd /home/opc/vz/clusters
for cluster in melbourne ; do
  bash install_vz_cluster_$cluster.sh
done

In about 5 mins, the installation should have completed. Follow the rest of the steps to complete the registration of the managed cluster:

cd /home/opc/vz/certs
for cluster in melbourne; do
  bash create_cert_secret_$cluster.sh
done

cd /home/opc/vz/cm
bash create_api_cm.sh

cd /home/opc/vz/clusters
for cluster in melbourne; do
  bash create_vmc_$cluster.sh
done

for cluster in melbourne; do
  bash register_vmc_$cluster.sh
done

On the operator host, run the following commands:

kubectx admin

# to retrieve the verrazzano user password
vz_access.sh

# to retrieve the urls
vz status

Login to Thanos and you should see your managed cluster’s Thanos registered. Wait a couple of hours before the sidecar starts shipping TSDB blocks to OCI Object Storage:

In Grafana, you can also observe the Thanos Dashboard:

You can also filter and observe selected clusters from the pull down at the top of the dashboard.

Summary

The combination of Prometheus and Thanos provides an effective way to monitor multiple OKE clusters on OCI. Thanos itself provides 2 main deployment models, both of which can be deployed on OCI.

In the latest Verrazzano release, we’ve improved Thanos support for OCI Object Storage integration so that you can use OKE Enhanced Clusters and OKE Workload Identity for an easier and yet more secure integration. It also makes it easier to automate the deployment of Prometheus and Thanos.

I would like to thank my colleague Fred Tibbits for his work on adding OKE Workload Identity support to Thanos Object Store as well as adding Thanos itself to Verrazzano.