Deploy Thanos Receive with native Oracle Cloud Object Storage on Kubernetes

Ali Mukadam
Oracle Developers
Published in
16 min readNov 29, 2022

In a previous article, we deployed Thanos as a highly available solution and long term storage solution for Prometheus. In the process, we used the sidecar model to deploy Thanos as a sidecar container in the Prometheus pod to retrieve the metrics and write them to object storage.

In this article, we are going to explore two alternatives:

  1. We will use the Thanos Receive model to receive the metrics data directly from Prometheus.
  2. We will also use the new native OCI Object Storage integration to store the TSDB data.

In the process, we will also highlight a new feature in the terraform OKE module, namely the use of defined tags.

Let’s start by looking at how the new OCI Object Storage for Thanos works.

OCI Object Storage for Thanos

After a lengthy gestation period, we were able to contribute a native integration for OCI to the Thanos project:

There are 2 heroes for this successful contribution:

  • Aaron Tam who wrote most if not all of the integration.
  • Avi Miller. You know what you did and we are grateful.

I’m sure you are wondering why you should use the native integration when the S3 interface works well enough. Well, when configuring the S3 integration for Thanos with OCI Object Storage, you basically need to specify the access and secret key in the object store config:

type: S3
config:
bucket: "<bucket_name>"
endpoint: "<object_storage_namespace>.compat.objectstorage.<region>.oraclecloud.com"
region: "<region>"
aws_sdk_auth: false
access_key: "access_key"
insecure: false
signature_version2: false
secret_key: "secret_key"

However, with the new Thanos OCI integration, you have different authentication options. The first and most obvious is using the “raw” provider where you specify the key:

type: OCI
config:
provider: "raw"
bucket: ""
compartment_ocid: ""
tenancy_ocid: ""
user_ocid: ""
region: ""
fingerprint: ""
privatekey: ""
passphrase: ""

Today, you have to specify them directly. But we can imagine that in a subsequent and improved version, the parameters specified here for authentication can come from a secret instead. Said secret can be stored encrypted in OCI Vault and retrieved via the External Secrets Operator instead. When this happy day eventually comes, you can rotate your private key and fingerprints, etc.

A second and more secure option available in the new OCI integration is to use OCI instance_principal.

type: OCI
config:
provider: "instance-principal"
bucket: "thanos"
compartment_ocid: "ocid1.compartment.oc1..a"

Instance principal works on the premise that since you (or your applications) are going to be making some API calls from a compute instance, you can authorize this compute instance to make these calls. In this way, you don’t need to flog your keys around in secrets or in files and such primitive methods. To use the instance principal method, there are two things you need to do:

  1. Create a dynamic group so that the compute instance is a member.
  2. Create policies to give the dynamic group the necessary permissions to interact with whatever OCI services is required (and at what level).

There are a few ways to specify this dynamic group membership:

  • Using instance OCID e.g instance.id = ‘ocid1.instance.oc1…’
  • Using the instance’s compartment OCID e.g. instance.compartment.id = ‘ocid1.instance.oc1…’
  • Using defined tags: tag.<tagnamespace>.<tagkey>.value=’<tagvalue>’

You can use the first option above if you want to test, or if your cluster is small but it’s really not scalable. The second option is far too broad and will make any instance in any region in the compartment a member of this dynamic group, even those which are not part of your cluster. You could use this in a developer environment — where you want to test and you have created a compartment specifically for your dev environment.

Now, when you eliminate the not-so-great options, whatever remains, however complex shall be the chosen solution. Ergo, using defined tags is the recommended way. In an OKE setting, as long as your worker nodes have the necessary defined tags, they become a member of this dynamic group. We’ll discuss this in more detail shortly, but bear with me for now.

Once the group membership is defined, we need to give this dynamic group access to object storage. You can do this by creating a policy with the following statements:

Allow dynamic-group thanos to manage buckets in compartment id ocid1.compartment.oc1..a
Allow dynamic-group thanos to manage objects in compartment id ocid1.compartment.oc1..a

Now, the worker nodes have just enough permissions to interact with OCI Object Storage. But we don’t want all the worker nodes to be part of this dynamic group and have access. So, how do we impose this restriction such that only certain nodes can run Thanos successfully? Let’s take a small detour and dive deeper into OKE node pools.

OKE module, node pools and defined tags

When you create an OKE (Kubernetes) cluster, you can group and manage your worker nodes using node pools. A node pool is essentially a group of compute instances that have the same configuration and function as worker nodes for a cluster. Compute instances in a node pool have the same following (but not limited to) attributes:

  • Kubernetes version
  • Image used to provision the worker node
  • Compute shape e.g. number of OCPUs, memory and block volume allocated as well as whether to use virtual machines or bare metal
  • CPU architecture e.g. Intel, AMD, ARM, GPU
  • Node labels
  • Freeform and defined tags

In an OKE cluster, you can have many node pools each with their own attributes and size e.g. the diagram below shows 3 node pools of varying shapes and sizes to meet mixed performance workload requirements:

Multiple node pools with different shapes and sizes

Likewise, you can have node pools with mixed architectures, all within the same cluster:

Multiple node pools with different CPU architectures

By using labels, you can ensure that certain pods running specific applications land on specific worker nodes that are most suitable for them or most cost effective for your needs.

Combine these 2 and you can achieve specialized node pools which you can further tune via node pools-specific cloud-init, e.g. :

In this way, you can right size your cluster and limit the use of costly resources e.g. GPU nodes or bare metal if required.

In our use case, we want to further restrict Thanos pods to run on the worker nodes of a specific node pool only. To achieve this, we configure the defined tags for the node pools. With the defined tags, we can now create a dynamic group with instance principal access to object storage.

Enough theory, show me the real stuff

Let’s start by creating the defined tags. We’ll use the OCI Console here but you can use Terraform, the CLI etc to create your tags. Search for Tag Namespaces in the OCI Console and create one:

The one we’ll be using here is ‘cn’ (short for cloud native). Click on the tag and create a tag key definition:

You can use a static value or you can further constrain this to a specific list of values. In this case, I’ll be using the tag key ‘role’ which has a static value:

tag keys

Finally, create a bucket called ‘thanos’ in Object Storage that will be used to store the TSDB blocks.

Create the OKE cluster

Next, using the Terraform oke module, create a couple of node pools using a variable input file terraform.tfvars (sample here):

allow_worker_ssh_access      = true
node_pools = {
np1 = {
shape = "VM.Standard.E4.Flex",
ocpus = 2,
memory = 32,
autoscale = true,
node_pool_size = 1,
max_node_pool_size = 3,
boot_volume_size = 150,
label = { app = "prometheus", pool = "np1" },
node_defined_tags = { "cn.role" = "prometheus" }
}
np2 = {
shape = "VM.Standard.E4.Flex",
ocpus = 2,
memory = 32,
autoscale = true,
node_pool_size = 1,
max_node_pool_size = 3,
boot_volume_size = 150,
label = { app = "thanos", pool="np2"},
node_defined_tags = { "cn.role" = "thanos" }
}
}

Also set “allow_worker_ssh_access” to true so we can test the instance principal later. Run Terraform apply to create your cluster and your node pools. You can disable this later.

From an infrastructure perspective, this is what we want to achieve:

Prometheus pods will run on worker nodes of node pool 1 and Thanos pods will run on worker nodes of node pool 2.

From a functional perspective, this is what we want to achieve (we’ll skip the Nginx Ingress controller and the application pods in this article):

We want Prometheus to use its remote write feature to send the metrics it has scraped to Thanos Receive which will then write it to OCI Object Storage.

Use the OCI Console to check the Kubernetes labels for the worker nodes in node pool 1:

and in node pool 2:

Create a dynamic group and policy to access Object Storage

Create a dynamic group and set the following rule:

tag.cn.role.value='thanos'

e.g.

Next, create a policy to give the dynamic group access to object storage. The policy need the following two statements:

Allow dynamic-group thanos to manage buckets in compartment id ocid1.compartment.oc1..a
Allow dynamic-group thanos to manage objects in compartment id ocid1.compartment.oc1..a

Use cloud shell or ssh to the worker node in np2 and install OCI cli so you can test instance principal:

sudo dnf install -y oracle-olcne-release-el8
sudo dnf config-manager --disable ol8_olcne13
sudo dnf config-manager --disable ol8_olcne14
sudo dnf config-manager --enable ol8_olcne15
sudo dnf install -y python3-oci-cli

Set authentication method to instance_principal and test access to Object Storage:

export OCI_CLI_AUTH=instance_principal
oci os bucket list --compartment-id ocid1.compartment.oc1..a --namespace-name <object storage namespace>

You can find the object storage namespace in your tenancy page in the OCI Console. You should see something like the following:

{
"data": [
{
"compartment-id": "ocid1.compartment.oc1..aa",
"created-by": "",
"defined-tags": null,
"etag": "",
"freeform-tags": null,
"name": "thanos",
"namespace": "",
"time-created": ""
}
]
}

If you can retrieve the bucket, then instance_principal is working on the node pool and we can proceed with Thanos configuration. You can skip this test if you want as its purpose is to illustrate instance principal.

Deploying Thanos

Log in to the operator host or use cloud shell and create a namespace:

kubectl create namespace monitoring

Create a file called storage.yaml and fill in the the compartment id value accordingly:

type: OCI
config:
provider: "instance-principal"
bucket: "thanos"
compartment_ocid: "ocid1.compartment.oc1.."

Note that this is the bare minimum configuration. There are additional parameters which you can further configure. Please refer to the documentation.

Next, create a secret out of it:

kubectl -n monitoring create secret generic thanos-objstore-config --from-file=objstore.yml=storage.yaml

Add the bitnami Thanos helm chart repo:

helm repo add bitnami https://charts.bitnami.com/bitnami

Generate a values manifest file:

helm show values bitnami/thanos > thanos.yaml

Edit the thanos.yaml file and set the following:

image
tag: 0.29.0
existingObjstoreSecret: "thanos-objstore-config"
queryFrontend:
enabled: true
bucketweb:
enabled: true
compactor:
enabled: true
storegateway:
enabled: true
receive:
enabled: true

Remember, we also want to restrict the Thanos pods to the worker nodes in node pool 2. To achieve this, look for all the nodeSelectors in the thanos.yaml and set as follows:

  nodeSelector:
app: thanos

The nodeSelector will ensure that the Thanos pods will land on worker nodes with labels app = “thanos”, in this case, the worker nodes in node pool 2. This ensures that the Thanos pods can use the instance principal method to access OCI Object Storage. Where does this app = “thanos” come from you ask? When we created the node pools, we set it as label:

np2 = {
shape = "VM.Standard.E4.Flex",
ocpus = 2,
memory = 32,
autoscale = true,
node_pool_size = 1,
max_node_pool_size = 3,
boot_volume_size = 150,
label = { app = "thanos", pool="np2"},
node_defined_tags = { "cn.role" = "thanos" }
}

We can now deploy Thanos:

helm install thanos bitnami/thanos --namespace monitoring -f thanos.yaml

Find the node pool for thanos using label:

k get nodes --show-labels | grep app=thanos
10.0.99.78 Ready node 44m v1.23.4 app=thanos,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=VM.Standard.E4.Flex,beta.kubernetes.io/os=linux

Let’s verify if the Thanos pod landed on the right worker node:

kubectl -n monitoring describe pod thanos-query-69854d896d-lrlzh
Name: thanos-query-69854d896d-lrlzh
Namespace: monitoring
Priority: 0
Node: 10.0.99.78/10.0.99.78

Lastly, check the pods. They should all have started successfully:

kubectl -n monitoring get pods
NAME READY STATUS RESTARTS AGE
thanos-bucketweb-846685d5df-wzgfs 1/1 Running 0 84s
thanos-compactor-85c7cbf4c6-zslwb 1/1 Running 0 84s
thanos-query-69854d896d-lrlzh 1/1 Running 0 84s
thanos-query-frontend-86699dcbfb-pczxr 1/1 Running 0 84s
thanos-receive-0 1/1 Running 0 84s
thanos-storegateway-0 1/1 Running 0 84s

If you wish to be even more selective, the three pods that really need instance principal access are those that will be interacting with Object Storage:

  • Compactor
  • Receive
  • Storegateway

Deploy Prometheus

We’ll use kube-prometheus-stack to deploy Prometheus:

helm repo add kps https://prometheus-community.github.io/helm-charts

Generate a values manifest file:

helm show values kps/kube-prometheus-stack > kps.yaml

Edit the kps.yaml file and set the following:

prometheus
prometheusSpec
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
remoteWrite:
- url: http://thanos-receive.monitoring.svc.cluster.local:19291/api/v1/receive

As we did for Thanos, locate the nodeSelectors and set to ensure the Prometheus pods land on the worker node in node pool 1 using the same labels (app = “prometheus”) we specified for node pool 1:

    nodeSelector:
app: prometheus

We can now deploy Prometheus:

helm install prometheus kps/kube-prometheus-stack --namespace monitoring -f kps.yaml

Verify the pods are running:

kubectl -n monitoring get pods

NAME READY STATUS RESTARTS AGE
alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 1 (107s ago) 2m8s
prometheus-grafana-66b6bf7789-249vn 3/3 Running 0 2m22s
prometheus-kube-prometheus-operator-549866c8dd-dbd88 1/1 Running 0 2m22s
prometheus-kube-state-metrics-7944d98645-4q77n 1/1 Running 0 2m22s
prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 0 2m8s
prometheus-prometheus-node-exporter-4jms5 1/1 Running 0 2m21s
prometheus-prometheus-node-exporter-svbs4 1/1 Running 0 2m22s
thanos-bucketweb-846685d5df-wzgfs 1/1 Running 0 16m
thanos-compactor-85c7cbf4c6-zslwb 1/1 Running 0 16m
thanos-query-69854d896d-lrlzh 1/1 Running 0 16m
thanos-query-frontend-86699dcbfb-pczxr 1/1 Running 0 16m
thanos-receive-0 1/1 Running 0 16m
thanos-storegateway-0 1/1 Running 0 16m

Let’s look at the logs for Prometheus:

kubectl -n monitoring logs -f prometheus-prometheus-kube-prometheus-prometheus-0
ts=2022-11-29T12:07:30.244Z caller=main.go:543 level=info msg="Starting Prometheus Server" mode=server version="(version=2.39.1, branch=HEAD, revision=dcd6af9e0d56165c6f5c64ebbc1fae798d24933a)"
ts=2022-11-29T12:07:30.244Z caller=main.go:548 level=info build_context="(go=go1.19.2, user=root@273d60c69592, date=20221007-15:57:09)"
ts=2022-11-29T12:07:30.244Z caller=main.go:549 level=info host_details="(Linux 5.4.17-2136.310.7.1.el8uek.x86_64 #2 SMP Wed Aug 17 15:14:08 PDT 2022 x86_64 prometheus-prometheus-kube-prometheus-prometheus-0 (none))"
ts=2022-11-29T12:07:30.244Z caller=main.go:550 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2022-11-29T12:07:30.244Z caller=main.go:551 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2022-11-29T12:07:30.246Z caller=web.go:559 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2022-11-29T12:07:30.247Z caller=main.go:980 level=info msg="Starting TSDB ..."
ts=2022-11-29T12:07:30.248Z caller=tls_config.go:231 level=info component=web msg="TLS is disabled." http2=false
ts=2022-11-29T12:07:30.252Z caller=head.go:551 level=info component=tsdb msg="Replaying on-disk memory mappable chunks if any"
ts=2022-11-29T12:07:30.252Z caller=head.go:595 level=info component=tsdb msg="On-disk memory mappable chunks replay completed" duration=2.955µs
ts=2022-11-29T12:07:30.252Z caller=head.go:601 level=info component=tsdb msg="Replaying WAL, this may take a while"
ts=2022-11-29T12:07:30.252Z caller=head.go:672 level=info component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
ts=2022-11-29T12:07:30.252Z caller=head.go:709 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=24.287µs wal_replay_duration=264.923µs wbl_replay_duration=161ns total_replay_duration=316.331µs
ts=2022-11-29T12:07:30.253Z caller=main.go:1001 level=info fs_type=XFS_SUPER_MAGIC
ts=2022-11-29T12:07:30.253Z caller=main.go:1004 level=info msg="TSDB started"
ts=2022-11-29T12:07:30.253Z caller=main.go:1184 level=info msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
ts=2022-11-29T12:07:30.261Z caller=dedupe.go:112 component=remote level=info remote_name=c6fe10 url=http://thanos-receive.monitoring.svc.cluster.local:19291/api/v1/receive msg="Starting WAL watcher" queue=c6fe10
ts=2022-11-29T12:07:30.261Z caller=dedupe.go:112 component=remote level=info remote_name=c6fe10 url=http://thanos-receive.monitoring.svc.cluster.local:19291/api/v1/receive msg="Starting scraped metadata watcher"
ts=2022-11-29T12:07:30.261Z caller=dedupe.go:112 component=remote level=info remote_name=c6fe10 url=http://thanos-receive.monitoring.svc.cluster.local:19291/api/v1/receive msg="Replaying WAL" queue=c6fe10
ts=2022-11-29T12:07:30.261Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2022-11-29T12:07:30.261Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2022-11-29T12:07:30.262Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2022-11-29T12:07:30.262Z caller=kubernetes.go:326 level=info component="discovery manager notify" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2022-11-29T12:07:30.357Z caller=main.go:1221 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=103.821285ms db_storage=1.363µs remote_storage=824.336µs web_handler=500ns query_engine=841ns scrape=325.239µs scrape_sd=1.202864ms notify=18.124µs notify_sd=425.578µs rules=94.403577ms tracing=6.311µs
ts=2022-11-29T12:07:30.357Z caller=main.go:965 level=info msg="Server is ready to receive web requests."
ts=2022-11-29T12:07:30.357Z caller=manager.go:943 level=info component="rule manager" msg="Starting rule manager..."
ts=2022-11-29T12:07:30.387Z caller=main.go:1184 level=info msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
ts=2022-11-29T12:07:30.391Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2022-11-29T12:07:30.391Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2022-11-29T12:07:30.392Z caller=kubernetes.go:326 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2022-11-29T12:07:30.392Z caller=kubernetes.go:326 level=info component="discovery manager notify" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2022-11-29T12:07:30.471Z caller=main.go:1221 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=84.211257ms db_storage=1.412µs remote_storage=78.379µs web_handler=300ns query_engine=942ns scrape=88.718µs scrape_sd=892.646µs notify=11.611µs notify_sd=229.146µs rules=78.660912ms tracing=6.243µs
ts=2022-11-29T12:07:36.032Z caller=dedupe.go:112 component=remote level=info remote_name=c6fe10 url=http://thanos-receive.monitoring.svc.cluster.local:19291/api/v1/receive msg="Done replaying WAL" duration=5.771154497s

We can see Prometheus is able to reach the Thanos Receive. Let’s come back in a couple of hours when Prometheus starts sending the TSDB blocks to Thanos Receive:

ts=2022-11-29T15:07:34.926Z caller=compact.go:519 level=info component=tsdb msg="write block" mint=1669723651686 maxt=1669730400000 ulid=01GK1YVNABQ27EXKYSPZ7PAW7Q duration=323.274ms
ts=2022-11-29T15:07:34.944Z caller=head.go:1192 level=info component=tsdb msg="Head GC completed" caller=truncateMemory duration=16.760054ms
ts=2022-11-29T17:00:01.420Z caller=compact.go:519 level=info component=tsdb msg="write block" mint=1669730400427 maxt=1669737600000 ulid=01GK259HMD5R7ZYXNAN5Q2HT9B duration=382.941853ms
ts=2022-11-29T17:00:01.439Z caller=head.go:1192 level=info component=tsdb msg="Head GC completed" caller=truncateMemory duration=18.006233ms
ts=2022-11-29T19:00:01.443Z caller=compact.go:519 level=info component=tsdb msg="write block" mint=1669737600427 maxt=1669744800000 ulid=01GK2C58WCJBJD33MW1MSHBM4X duration=406.65534ms
ts=2022-11-29T19:00:01.458Z caller=head.go:1192 level=info component=tsdb msg="Head GC completed" caller=truncateMemory duration=13.797336ms

And in the Thanos Receive logs, we can see it uploading the TSDB blocks to OCI Object Storage:

level=info ts=2022-11-29T15:07:35.634586334Z caller=compact.go:519 component=receive component=multi-tsdb tenant=default-tenant msg="write block" mint=1669723651686 maxt=1669730400000 ulid=01GK1YVNY44QT5JNJM8SS4RWV1 duration=398.359924ms
level=info ts=2022-11-29T15:07:35.656935812Z caller=head.go:1192 component=receive component=multi-tsdb tenant=default-tenant msg="Head GC completed" caller=truncateMemory duration=20.755986ms
level=info ts=2022-11-29T15:07:52.64378284Z caller=shipper.go:334 component=receive component=multi-tsdb tenant=default-tenant msg="upload new block" id=01GK1YVNY44QT5JNJM8SS4RWV1
level=info ts=2022-11-29T17:00:01.477942982Z caller=compact.go:519 component=receive component=multi-tsdb tenant=default-tenant msg="write block" mint=1669730400427 maxt=1669737600000 ulid=01GK259HKJBTJET2W57PG7CNDH duration=467.532064ms
level=info ts=2022-11-29T17:00:01.501835652Z caller=head.go:1192 component=receive component=multi-tsdb tenant=default-tenant msg="Head GC completed" caller=truncateMemory duration=22.437929ms
level=info ts=2022-11-29T17:00:22.584036806Z caller=shipper.go:334 component=receive component=multi-tsdb tenant=default-tenant msg="upload new block" id=01GK259HKJBTJET2W57PG7CNDH
level=info ts=2022-11-29T19:00:01.523908074Z caller=compact.go:519 component=receive component=multi-tsdb tenant=default-tenant msg="write block" mint=1669737600427 maxt=1669744800000 ulid=01GK2C58W441B8RY8HTKNYT9DJ duration=495.405003ms
level=info ts=2022-11-29T19:00:01.540641622Z caller=head.go:1192 component=receive component=multi-tsdb tenant=default-tenant msg="Head GC completed" caller=truncateMemory duration=14.962991ms
level=info ts=2022-11-29T19:00:22.500773034Z caller=shipper.go:334 component=receive component=multi-tsdb tenant=default-tenant msg="upload new block" id=01GK2C58W441B8RY8HTKNYT9DJ

Finally, verify the TSDB blocks have been written:

This shows a successful integration between Thanos Receive and OCI Object Storage using the new OCI Object Storage integration feature.

Access with Grafana

We can now access Grafana to look at the how our cluster is performing. By default, the Grafana that comes pre-installed with kube-prometheus-stack will be configured to use Prometheus as its default data source. Since Thanos Query uses a Prometheus-compatible API, we only need to add a new data source of type Prometheus and point it to Thanos Query Frontend service URL:

http://thanos-query-frontend.monitoring:9090/

Make sure the Thanos data source is set as default:

Save and test to make sure the data source is working. You can now view the pre-built dashboards:

Recommendations

  1. Don’t confuse labels and tags. Labels are used by Kubernetes to assign pods to certain nodes whereas tags are used by OCI to determine dynamic group membership (or not) of the compute nodes.
  2. When assigning privileges to dynamic groups via policies, always use the principle of least privilege. In this case, it will ensure each component have the minimum security permissions to do what they re supposed to and nothing more.
  3. Thanos comes with a few components (Sidecar, Receive, Storegateway, Query, Query Frontend etc) and can be deployed in a couple of architectures (sidecar, receive). When combined with infrastructure/cloud, a few more deployment options can become available. Therefore, take the time to understand the purpose of each component and how you can best use them to determine the best fit for your architecture. Don’t just go with whatever has been published.

Summary

In this article, we’ve shown you three main things:

  1. How to use the new OCI Object Storage integration with Thanos, the two authentication methods to configure them.
  2. How to configure and use Prometheus Remote Write to send TSDB blocks to Thanos Receive and how to use them on OKE with OCI Object Storage.
  3. How to use the new defined tags feature to create specialized node pools, and if you need authenticated access to some OCI Services on the worker nodes, how to use the instance principal feature to configure a more restricted and secure access.

The combination of the above three opens a whole new realm of monitoring opportunities:

  • You can now have a more flexible architecture which can accommodate the simultaneous monitoring of multiple Kubernetes clusters running in different regions in OCI, or in other cloud providers from a single location.
  • You can also monitor existing Kubernetes clusters in your private data centers without having to punch holes in your firewall. Instead, you can get Prometheus to ship the metrics data directly to a Thanos Receive that can run either locally or remotely in OCI.
  • You can monitor any type of system using Prometheus without the need for an entire Kubernetes cluster. All you need to do is tell Prometheus where your Receive is running and how to ship your data. Think of systems running on edge with low power, low compute, low memory and low storage. You can now monitor them and store their long term metrics in OCI Object Storage at a more affordable cost.

In a future post, we’ll explore some of these scenarios. In the meantime, you can also read Medallia’s (who also uses OCI) use of Thanos to monitor Kubernetes clusters across 40 data centers:

Once again, I would like to take the opportunity to thank my colleagues Aaron and Avi for their efforts in making OCI Object Storage a first class citizen for Thanos storage.

If you’re curious about the goings-on of Oracle Developers in their natural habitat, come join us on our public Slack channel! We don’t mind being your fish bowl 🐠

--

--