How to Store Telemetry Metrics in AWS S3 Using Thanos
Learn how to store telemetry metrics in long-term storage such as AWS S3 using Thanos
In this article, we will deploy Thanos in AWS ECS to store telemetry data in an S3 bucket. First, we will see how to generate telemetry metrics from a Python application coded with FastAPI and OpenTelemetry SDK, then we will configure the OpenTelemetry Collector to process the metrics generated. And finally, we will get a look at how Thanos is configured to receive the metrics sent from the OpenTelemetry Collector and store them.
What is Thanos?
Thanos is a group of components part of an open-source, highly available Prometheus setup with long-term storage capabilities. These components provide a global query view and unlimited metrics retention, extending the system with an object storage service such as AWS S3. It also supports Prometheus Query API, which allows technologies such as Grafana to query and visualize metrics, all in one place. Thanos’s main goals are operation simplicity and retaining Prometheus’s reliability properties.
These features previously mentioned can be deployed independently of each other. This allows you to use or test different configurations depending on your needs.
Architecture
The following diagram shows an application load balancer used to route traffic to an application called “Demo app”. This application will send telemetry metrics to the OpenTelemetry collector to be processed and then sent to a Thanos receiver to store them in S3. The “Demo app”, OpenTelemetry collector, and Thanos receiver are containers:
- AWS ECS will host three containers: the demo app, the OpenTelemetry collector, and the Thanos receiver.
- An application load balancer will route traffic to the demo app container. As shown previously, the demo app was coded using FastAPI and instrumented with OpenTelemetry.
- Once the demo app receives traffic, telemetry metrics will be generated and sent to the OpenTelemetry collector.
- The OpenTelemetry Collector receives the metrics, processes them, and exports them to the Thanos receiver.
- Thanos receiver implements the Prometheus Remote Write API to receive the metrics and then uploads TSDB (time series database) blocks to an AWS S3 bucket every 2 hours by default.
Prerequisites
To follow this article, you should have the following things ready:
- An AWS Account.
- Linux OS or Windows Subsystem for Linux (WSL) if you have Windows.
- AWS CLI configured.
- Docker.
- Git.
- Terraform.
Components
Now we’re going to see how each component is configured to serve its purpose in this demonstration. We’ll take a look at how the Demo App was coded to generate telemetry metrics and how both OpenTelemetry Collector and Thanos Receiver are configured to process the metrics and store them in S3.
Demo App
The Demo App is an application coded in FastAPI with OpenTelemetry SDK included in its source code. It has two paths defined: /hello will return a “Hello!” message and /uuid will return a string composed of letters, numbers, and dashes. The metrics will be generated by OpenTelemetry SDK every time the /uuid path is accessed. This is the code for the demo app:
import os, uuid
from fastapi import FastAPI
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry import metrics
# from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader, ConsoleMetricExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
app = FastAPI()
# Service name is required for most backends
service_name = os.getenv('OTEL_SERVICE_NAME') or "test-service"
print(f"service name: {service_name}")
resource = Resource(attributes={
SERVICE_NAME: service_name
})
metrics_endpoint_environment_variables = ["OTEL_EXPORTER_OTLP_ENDPOINT", "OTEL_EXPORTER_OTLP_METRICS_ENDPOINT"]
if any(variable in metrics_endpoint_environment_variables for variable in os.environ):
print(f"endpoint: {os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT') or os.getenv('OTEL_EXPORTER_OTLP_METRICS_ENDPOINT')}")
reader = PeriodicExportingMetricReader(OTLPMetricExporter())
else:
reader = PeriodicExportingMetricReader(ConsoleMetricExporter())
meterProvider = MeterProvider(resource=resource, metric_readers=[reader])
metrics.set_meter_provider(meterProvider)
@app.get("/hello")
def read_hello():
return {"messsage": "Hello!"}
@app.get("/uuid")
def read_uuid():
myuuid = uuid.uuid4()
return {"messsage": str(myuuid)}
FastAPIInstrumentor.instrument_app(app, excluded_urls="hello")
OpenTelemetry Collector
The OpenTelemetry Collector is a vendor-agnostic implementation that receives, processes, and exports telemetry data. The code below is the configuration for the Collector. Receivers can collect telemetry data from one or multiple sources. Exporters send the data collected to one or more destinations. The service section is for enabling components in the Collector based on the configuration found in the receivers and exporters. Configuring a component won’t enable it, it has to be defined in the service section. At least one receiver and one exporter are needed.
What’s shown in the following example code is a Collector configuration with one receiver, the otlp (OpenTelemetry) receiver, which listens on ports 4317 for gRPC and 4318 for HTTP. Furthermore, there are two exporters: the debug exporter, which shows in the Collector logs detailed information about every telemetry record, and a Prometheus remote write exporter, which sends data to Prometheus or any Prometheus remote write compatible backends:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
exporters:
debug:
verbosity: detailed
prometheusremotewrite/thanos:
endpoint: "http://localhost:10908/api/v1/receive"
namespace: demo-app
external_labels:
thanos: "true"
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [debug, prometheusremotewrite/thanos]
The options configured for the Prometheus remote write exporter are:
endpoint
is a URL to send metrics. This is mandatory.external_labels
is a map of label names and values to be attached to each metric data point. This is optional, but for this example, is required because Thanos needs a label to accept the metrics sent to it.namespace
is a prefix attached to all names of exported metrics. This is optional.
The Thanos Receiver
Here we’re going to see how the Thanos Receiver can be configured to store the metrics received in an S3 bucket.
How do I configure the storage?
Thanos uses object storage as primary storage for metrics along with their metadata. The following code example explains how to configure an AWS S3 bucket as object storage with a YAML file:
type: S3
config:
bucket: "test-thanos-demo"
endpoint: "s3.us-east-1.amazonaws.com"
region: "us-east-1"
aws_sdk_auth: true
signature_version2: false
sse_config:
type: "SSE-S3"
prefix: "demo-app"
Here, you can see the AWS S3 configuration options for object storage:
- The
type
S3 needs to be specified. - The name of the
bucket
and theregional endpoint
endpoint are mandatory. - Although in this example a
region
was specified, it’s not necessary. - With
aws_sdk_auth
set totrue
, Thanos will use the default authentication methods of the AWS SDK based on known environment variables (AWS_PROFILE
,AWS_WEB_IDENTITY_TOKEN_FILE
… etc) and known AWS config files (~/.aws/config
). This is to avoid setting upaccess_key
andsecret_key
keys. - AWS requires signature v4, that’s why
signature_version2
was set tofalse
. - S3 Server-Side Encryption (SSE) can be configured in the
sse_config
settings. SSE-S3, SSE-KMS, and SSE-C are supported. When set to SSE-S3, nothing else needs to be configured. - The field
prefix
is optional. It can be used to store your metrics in a separate folder in your S3 bucket.
The Receiver command
This command is defined in the AWS ECS task definition as a parameter. It’s equivalent to passing a command to a container when it’s started using docker run or the CMD instruction in a Dockerfile. It is required so the Thanos container will be configured to receive metrics from the Collector. The command definition is in Terraform configuration files, which means the Thanos container will be started with this command when all components have been deployed:
command = [
"receive",
"--remote-write.address=0.0.0.0:10908",
"--label=thanos=\"true\"",
"--objstore.config-file=/bucket.yaml"
]
- It is mandatory to use
receive
. --remote-write.address
is the most important setting because it will expose an endpoint to send metrics.--label
key is used to indicate a label the metrics need to have so that Thanos receives them. This flag will be deprecated in the future, but it still needs to be used right now.--objstore.config-file
indicates a path to a YAML file with the object store configuration shown previously.
Deploy all the components in AWS ECS
The following is a step-by-step guide to deploy the components previously described. First, clone the repository and go to the Terraform folder:
git clone https://github.com/josembar/thanos-s3-demo.git
cd thanos-s3-demo
cd aws-resources/terraform/
Once you are in the folder, initialize the project:
terraform init
The terraform init
command initializes a directory containing Terraform configuration files. Check this link for more details. Check the resources that will be created by running the following command:
terraform plan
The previous command lets you preview the changes that Terraform plans to make to your infrastructure without making any actual changes. To deploy, run the command below:
terraform apply
This command will execute the actions proposed when the command terraform plan is run. It will require confirmation. After confirming, the deployment will start. When it’s completed, you’ll see a message like this:
Wait for a few minutes, then run this command:
terraform output demo_app_url
An output called demo_app_url was declared in the Terraform configuration. The Outputs are information from the resources created or updated after running terraform apply
. You will get the URL of the demo app. It can be something similar to this:
"http://thanos-receiver-demo-alb-1543402685.us-east-1.elb.amazonaws.com/uuid"
The hostname of this URL is the load balancer’s public DNS alias. The uuid path is defined in the demo app’s source code. Access this URL in any web browser. Refresh a couple of times:
Thanos will upload TSDB blocks (time series database) of metrics every two hours to S3:
Finally, destroy the resources:
terraform destroy
Running the command above will destroy all the resources previously created. It will require confirmation, the same as running terraform apply
.
Conclusions
- Thanos is an open-source tool, there’s no cost for using it. The costs associated with Thanos implementation are related to computing and storage.
- There are other open-source alternatives to store metrics, such as Timescale and InfluxDB. These two offer cloud solutions that require a subscription, will charge per month based on usage, and in most cases will include support. Thanos does not have this kind of offering. This is important to mention because having a fully managed service and support can be something some organizations wish or require.
- InfluxDB and Timescale can also be implemented without a subscription, which can lead to more tasks such as provisioning and managing file systems, SSD drives, PostgreSQL for Timescale, and backups. With Thanos, an object storage service will store the metrics, with no need to provision hard drives or databases.
- Using AWS S3 as object storage gives the possibility to do cost optimization and manage backups only, by using S3 features such as S3 lifecycle configuration and object replication.
References
- GitHub — josembar/thanos-s3-demo
- Thanos Receive
- Thanos Object Storage & Data Format
- Quick start | OpenTelemetry
- Opentelemety Prometheus Remote Write Exporter
- Configuration | OpenTelemetry
- OpenTelemetry FastAPI Instrumentation
- Exporters | OpenTelemetry
- Opentelemetry SDK Environment Variables
- OTLP Exporter Configuration | OpenTelemetry