How to Store Telemetry Metrics in AWS S3 Using Thanos

Learn how to store telemetry metrics in long-term storage such as AWS S3 using Thanos

Published in

Globant

8 min readJun 11, 2024

In this article, we will deploy Thanos in AWS ECS to store telemetry data in an S3 bucket. First, we will see how to generate telemetry metrics from a Python application coded with FastAPI and OpenTelemetry SDK, then we will configure the OpenTelemetry Collector to process the metrics generated. And finally, we will get a look at how Thanos is configured to receive the metrics sent from the OpenTelemetry Collector and store them.

What is Thanos?

Thanos is a group of components part of an open-source, highly available Prometheus setup with long-term storage capabilities. These components provide a global query view and unlimited metrics retention, extending the system with an object storage service such as AWS S3. It also supports Prometheus Query API, which allows technologies such as Grafana to query and visualize metrics, all in one place. Thanos’s main goals are operation simplicity and retaining Prometheus’s reliability properties.

These features previously mentioned can be deployed independently of each other. This allows you to use or test different configurations depending on your needs.

Architecture

The following diagram shows an application load balancer used to route traffic to an application called “Demo app”. This application will send telemetry metrics to the OpenTelemetry collector to be processed and then sent to a Thanos receiver to store them in S3. The “Demo app”, OpenTelemetry collector, and Thanos receiver are containers:

AWS ECS will host three containers: the demo app, the OpenTelemetry collector, and the Thanos receiver.
An application load balancer will route traffic to the demo app container. As shown previously, the demo app was coded using FastAPI and instrumented with OpenTelemetry.
Once the demo app receives traffic, telemetry metrics will be generated and sent to the OpenTelemetry collector.
The OpenTelemetry Collector receives the metrics, processes them, and exports them to the Thanos receiver.
Thanos receiver implements the Prometheus Remote Write API to receive the metrics and then uploads TSDB (time series database) blocks to an AWS S3 bucket every 2 hours by default.

Prerequisites

To follow this article, you should have the following things ready:

An AWS Account.
Linux OS or Windows Subsystem for Linux (WSL) if you have Windows.
AWS CLI configured.
Docker.
Git.
Terraform.

Components

Now we’re going to see how each component is configured to serve its purpose in this demonstration. We’ll take a look at how the Demo App was coded to generate telemetry metrics and how both OpenTelemetry Collector and Thanos Receiver are configured to process the metrics and store them in S3.

Demo App

The Demo App is an application coded in FastAPI with OpenTelemetry SDK included in its source code. It has two paths defined: /hello will return a “Hello!” message and /uuid will return a string composed of letters, numbers, and dashes. The metrics will be generated by OpenTelemetry SDK every time the /uuid path is accessed. This is the code for the demo app:

import os, uuid
from fastapi import FastAPI
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry import metrics
# from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader, ConsoleMetricExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

app = FastAPI()

# Service name is required for most backends
service_name = os.getenv('OTEL_SERVICE_NAME') or "test-service"
print(f"service name: {service_name}")
resource = Resource(attributes={
     SERVICE_NAME: service_name
})

metrics_endpoint_environment_variables = ["OTEL_EXPORTER_OTLP_ENDPOINT", "OTEL_EXPORTER_OTLP_METRICS_ENDPOINT"]
if any(variable in metrics_endpoint_environment_variables for variable in os.environ):
 print(f"endpoint: {os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT') or os.getenv('OTEL_EXPORTER_OTLP_METRICS_ENDPOINT')}")
 reader = PeriodicExportingMetricReader(OTLPMetricExporter())
else:
 reader = PeriodicExportingMetricReader(ConsoleMetricExporter())
meterProvider = MeterProvider(resource=resource, metric_readers=[reader])
metrics.set_meter_provider(meterProvider)

@app.get("/hello")
def read_hello():
 return {"messsage": "Hello!"}

@app.get("/uuid")
def read_uuid():
 myuuid = uuid.uuid4()
 return {"messsage": str(myuuid)}

FastAPIInstrumentor.instrument_app(app, excluded_urls="hello")

OpenTelemetry Collector

The OpenTelemetry Collector is a vendor-agnostic implementation that receives, processes, and exports telemetry data. The code below is the configuration for the Collector. Receivers can collect telemetry data from one or multiple sources. Exporters send the data collected to one or more destinations. The service section is for enabling components in the Collector based on the configuration found in the receivers and exporters. Configuring a component won’t enable it, it has to be defined in the service section. At least one receiver and one exporter are needed.

What’s shown in the following example code is a Collector configuration with one receiver, the otlp (OpenTelemetry) receiver, which listens on ports 4317 for gRPC and 4318 for HTTP. Furthermore, there are two exporters: the debug exporter, which shows in the Collector logs detailed information about every telemetry record, and a Prometheus remote write exporter, which sends data to Prometheus or any Prometheus remote write compatible backends:

receivers:
  otlp:
 protocols:
     grpc:
         endpoint: 0.0.0.0:4317
     http:
         endpoint: 0.0.0.0:4318
exporters:
  debug:
 verbosity: detailed
  prometheusremotewrite/thanos:
 endpoint: "http://localhost:10908/api/v1/receive"
 namespace: demo-app
 external_labels:
     thanos: "true"
service:
  pipelines:
 metrics:
     receivers: [otlp]
     exporters: [debug, prometheusremotewrite/thanos]

The options configured for the Prometheus remote write exporter are:

endpoint is a URL to send metrics. This is mandatory.
external_labels is a map of label names and values to be attached to each metric data point. This is optional, but for this example, is required because Thanos needs a label to accept the metrics sent to it.
namespace is a prefix attached to all names of exported metrics. This is optional.

The Thanos Receiver

Here we’re going to see how the Thanos Receiver can be configured to store the metrics received in an S3 bucket.

How do I configure the storage?

Thanos uses object storage as primary storage for metrics along with their metadata. The following code example explains how to configure an AWS S3 bucket as object storage with a YAML file:

type: S3
config:
  bucket: "test-thanos-demo"
  endpoint: "s3.us-east-1.amazonaws.com"
  region: "us-east-1"
  aws_sdk_auth: true
  signature_version2: false
  sse_config:
 type: "SSE-S3"
prefix: "demo-app"

Here, you can see the AWS S3 configuration options for object storage:

The type S3 needs to be specified.
The name of the bucket and the regional endpoint endpoint are mandatory.
Although in this example a region was specified, it’s not necessary.
With aws_sdk_auth set to true, Thanos will use the default authentication methods of the AWS SDK based on known environment variables (AWS_PROFILE , AWS_WEB_IDENTITY_TOKEN_FILE … etc) and known AWS config files (~/.aws/config). This is to avoid setting up access_key and secret_key keys.
AWS requires signature v4, that’s why signature_version2 was set to false.
S3 Server-Side Encryption (SSE) can be configured in the sse_config settings. SSE-S3, SSE-KMS, and SSE-C are supported. When set to SSE-S3, nothing else needs to be configured.
The field prefix is optional. It can be used to store your metrics in a separate folder in your S3 bucket.

The Receiver command

This command is defined in the AWS ECS task definition as a parameter. It’s equivalent to passing a command to a container when it’s started using docker run or the CMD instruction in a Dockerfile. It is required so the Thanos container will be configured to receive metrics from the Collector. The command definition is in Terraform configuration files, which means the Thanos container will be started with this command when all components have been deployed:

command = [
     "receive",
     "--remote-write.address=0.0.0.0:10908",
     "--label=thanos=\"true\"",
     "--objstore.config-file=/bucket.yaml"
   ]

It is mandatory to use receive.
--remote-write.address is the most important setting because it will expose an endpoint to send metrics.
--label key is used to indicate a label the metrics need to have so that Thanos receives them. This flag will be deprecated in the future, but it still needs to be used right now.
--objstore.config-file indicates a path to a YAML file with the object store configuration shown previously.

Deploy all the components in AWS ECS

The following is a step-by-step guide to deploy the components previously described. First, clone the repository and go to the Terraform folder:

git clone https://github.com/josembar/thanos-s3-demo.git
cd thanos-s3-demo
cd aws-resources/terraform/

Once you are in the folder, initialize the project:

terraform init

The terraform init command initializes a directory containing Terraform configuration files. Check this link for more details. Check the resources that will be created by running the following command:

terraform plan

The previous command lets you preview the changes that Terraform plans to make to your infrastructure without making any actual changes. To deploy, run the command below:

terraform apply

This command will execute the actions proposed when the command terraform plan is run. It will require confirmation. After confirming, the deployment will start. When it’s completed, you’ll see a message like this:

Wait for a few minutes, then run this command:

terraform output demo_app_url

An output called demo_app_url was declared in the Terraform configuration. The Outputs are information from the resources created or updated after running terraform apply. You will get the URL of the demo app. It can be something similar to this:

"http://thanos-receiver-demo-alb-1543402685.us-east-1.elb.amazonaws.com/uuid"

The hostname of this URL is the load balancer’s public DNS alias. The uuid path is defined in the demo app’s source code. Access this URL in any web browser. Refresh a couple of times:

Thanos will upload TSDB blocks (time series database) of metrics every two hours to S3:

Finally, destroy the resources:

terraform destroy

Running the command above will destroy all the resources previously created. It will require confirmation, the same as running terraform apply.

Conclusions

Thanos is an open-source tool, there’s no cost for using it. The costs associated with Thanos implementation are related to computing and storage.
There are other open-source alternatives to store metrics, such as Timescale and InfluxDB. These two offer cloud solutions that require a subscription, will charge per month based on usage, and in most cases will include support. Thanos does not have this kind of offering. This is important to mention because having a fully managed service and support can be something some organizations wish or require.
InfluxDB and Timescale can also be implemented without a subscription, which can lead to more tasks such as provisioning and managing file systems, SSD drives, PostgreSQL for Timescale, and backups. With Thanos, an object storage service will store the metrics, with no need to provision hard drives or databases.
Using AWS S3 as object storage gives the possibility to do cost optimization and manage backups only, by using S3 features such as S3 lifecycle configuration and object replication.

References

— Jose Barrantes