Let’s Learn How to Send Internal OTel Collector Telemetry to an Observability Backend
Configuring Your OTel Collector’s Internal Telemetry Endpoint
Many organizations using OpenTelemetry (OTel) these days rely not only on the OpenTelemetry API and SDK, they also rely on the OTel Collector. The Collector is a flexible and powerful data pipeline which allows you to ingest telemetry data from multiple sources (including application and infrastructure), transform the data, and export it to your backend(s) of choice for analysis. To say that the Collector is a critical component of your Observability landscape is an understatement.
So it stands to reason then, that just as you’d want to observe your applications and infrastructure, it makes sense to do the same for your OTel Collectors. Fortunately, the good folks working on the OpenTelemetry project have you covered, because the Collector also emits its own internal telemetry, allowing you observe it.
Learning about internal Collector telemetry has been on my OTel to-do list for a while, and I finally got around to it. 🎉 Today, I will share some of my learnings, specifically around exporting internal telemetry directly to an Observability backend.
Ready? Let’s do this!
Exporting Internal Telemetry
When I started playing around with observing my own Collector, I learned that there are three ways to emit the Collector’s internal telemetry:
1- Self-ingesting and exporting, scraping metrics via its own Prometheus Receiver
Here, your Collector scrapes its own metrics via its own Prometheus receiver. This means that you must configure your Collector’s Prometheus receiver to scrape itself, like this:
prometheus:
config:
scrape_configs: []
# Collector metrics
- job_name: 'otel-collector'
scrape_interval: 10s
static_configs:
- targets: [ '0.0.0.0:8888' ]
And then configure your Collector’s internal metrics exporter by adding a telemetry.metrics
section to the Collector’s service
configuration section:
service:
telemetry:
metrics:
level: detailed
readers:
- periodic:
exporter:
otlp:
endpoint: http://0.0.0.0:4318
protocol: http/protobuf
There are two pitfalls with this approach. First, it only applies to metrics (because Prometheus is only for metrics). Second, by sending the Collector’s metrics to itself for export, you end up in a self-monitoring loop, which, among other things, can degrade performance, since it’s sending additional telemetry to itself for processing.
You can check out a full example Collector configuration here.
2- Self-ingesting, exporting via the Go Prometheus exporter
This is a similar scenario to the one above, except that your Collector exports its internal telemetry behind the scenes via the Go Prometheus exporter.
🌈 Fun fact: The Go Prometheus exporter is not the same as the Collector’s Prometheus exporter. The Collector’s exporter is designed to aggregate metrics from multiple resources/targets together, whereas the Go SDK exporter is designed to only handle metrics from a single resource. That makes sense, because the only metrics source here is the Collector’s own metrics, so using the Collector’s exporter would be overkill.
The Collector’s internal metrics exporter configuration would look like this:
service:
telemetry:
metrics:
readers:
- pull:
exporter:
prometheus:
host: '0.0.0.0'
port: 8888
Again, you run into the same pitfalls as the first approach.
3- Exporting directly to an Observability backend
This approach allows you to export internal Collector telemetry directly to an Observability backend, eliminating the need for the Collector to ingest, process, and export its own telemetry.
There are a few advantages to this setup.
First off, it enables you to export your telemetry directly to whatever OTLP backend you want — i.e. any backend that supports OTLP ingest. This can be the same backend for traces, logs, and metrics, or it can be different backends for each.
NOTE: I am a proponent of having a single backend for ingesting all signals to really deliver on Observability’s promise. You can read more about my thoughts here.
The second advantage is that you eliminate the self-monitoring loop that occurs with the first two approaches.
Below is my own service.telemetry
configuration.
service:
telemetry:
resource:
k8s.namespace.name: "${env:K8S_POD_NAMESPACE}"
k8s.pod.name: "${env:K8S_POD_NAME}"
k8s.node.name: "${env:K8S_NODE_NAME}"
metrics:
level: detailed
readers:
- periodic:
interval: 60000
exporter:
otlp:
protocol: http/protobuf
temporality_preference: delta
endpoint: https://${ENDPOINT}
headers:
- name: Authorization
value: "Api-Token ${TOKEN}"
logs:
level: info
output_paths: ["stdout"]
error_output_paths: ["stderr"]
processors:
- batch:
exporter:
otlp:
protocol: http/protobuf
endpoint: https://${ENDPOINT}
headers:
- name: Authorization
value: "Api-Token ${TOKEN}"
traces:
processors:
- batch:
exporter:
otlp:
protocol: http/protobuf
endpoint: https://${ENDPOINT}
headers:
- name: Authorization
value: "Api-Token ${TOKEN}"
You can check out the full example here, which is configured to send Collector telemetry to Dynatrace.
Let’s break things down.
First, let’s take a look at the resource
section:
resource:
k8s.namespace.name: "${env:K8S_POD_NAMESPACE}"
k8s.pod.name: "${env:K8S_POD_NAME}"
k8s.node.name: "${env:K8S_NODE_NAME}"
In my case, I’m running my Collector in Kubernetes using the OpenTelemetry Operator, and this section enriches my internal Collector telemetry with Kubernetes attributes for namespace, pod name, and node name when they’re exported.
Next, I wanted to note that the exporter
configuration is more or less the same for traces, logs, and metrics. They all look like this:
exporter:
otlp:
protocol: http/protobuf
endpoint: https://${ENDPOINT}
headers:
- name: Authorization
value: "Api-Token ${TOKEN}"
I’m using http/protobuf
as my protocol, because Dynatrace accepts OTLP data via HTTP.
Since I’m using Dynatrace as my internal telemetry endpoint, my endpoints look like this:
- metrics:
https://${DT_TENANT}.apps.dynatrace.com/api/v2/otlp/v1/metrics
- traces:
https://${DT_TENANT}.apps.dynatrace.com/api/v2/otlp/v1/traces
- logs:
https://${DT_TENANT}.apps.dynatrace.com/api/v2/otlp/v1/logs
Where ${DT_TENANT}
is my Dynatrace tenant name.
You’ll need to consult your own backend’s docs to determine your endpoint URIs, but it’s safe to say that it will look very similar to what you use for your Collector’s OTLP exporter configuration. You’ll also need to use the same authorization token that you use for your backend’s OTLP exporter configuration.
But there’s one gotcha here. The OTLP exporter configuration’s authorization header is configured as a key/value pair:
endpoint: "https://${ENDPOINT}"
headers:
Authorization: "Api-Token ${TOKEN}"
The internal telemetry’s authorization header, on the other hand, is an array of attriute pairs. A name attribute whose value is Authorization
, and a value
attribute whose value is “Api-Token ${TOKEN}”
:
endpoint: https://${ENDPOINT}
headers:
- name: Authorization
value: "Api-Token ${TOKEN}"
While the Authorization: “Api-Token ${TOKEN}”
configuration still exports metrics, using this configuration causes the temporality_preference: delta
setting to be ignored:
metrics:
level: detailed
readers:
- periodic:
exporter:
otlp:
protocol: http/protobuf
temporality_preference: delta
endpoint: https://${DT_ENVIRONMENT}/api/v2/otlp/v1/metrics
headers:
- name: Authorization
value: "Api-Token ${TOKEN}"
Since the temporality_preference: delta
setting was ignored, my Collector metrics were exported with additional suffixes. For example, I was expecting to see the Collector’s otelcol_process_memory_rss
metric to look at its memory consumption. Instead, otelcol_process_memory_rss_bytes
was being exported.
After asking around, I learned that this was happening because _bytes
is a Prometheus units suffix that normally gets truncated when you go through the Prometheus receiver (i.e. approach #1). I was initially using approach #1, so I hadn’t seen the _bytes
suffix. When exporting directly to a backend, however, you need to set temporality_preference: delta
to make that _bytes
suffix go bye-bye. 👋
NOTE: Big thanks to my friend Alex Boten, for figuring out this workaround, and for telling me about it.
PS: Another reason why I used temporality_preference: delta
is because Dynatrace only accepts metrics with delta temporality via OTLP HTTP, and leaving it out would’ve caused any cumulative Collector metrics to be dropped.
Collector API reference
I was surprised that I couldn’t find an API reference on the OTel Collector’s GitHub repository, like I’ve seen with in the OTel Operator repository. After asking around, I learned that the Collector uses the Go implementation of the declarative SDK configuration format, and that the API docs can be found here. Not the most user friendly thing to read. Pass. But maybe you’ll find it useful, so I’m including it here, just in case. 🤷♀️
After asking around some more, I learned about kitchen-sink.yaml
in the opentelemetry-configuration repository. I would’ve never found it had it not been for folks in the OTel community pointing me here. Now, it’s not the most intuitive file to read, because things in there don’t translate 100% to your telemetry configuration in the Collector. That being said, it’s nicer than trying to read that Go API, and it does show different combinations of configurations that you can use, that would apply to your internal telemetry. Like these lines, for metrics configuration.
And if you’re looking for various examples of Collector configs, you can also check out Juraci Paixão Kröhling’s otelcol-cookbook repository. Juraci is one of the maintainers of the OTel Collector, and sits on the OTel Governance Committee. And bonus: he’s a fellow Brazilian. 🇧🇷
4- Exporting to another Collector
Another option which is similar to option #3 is to export the Collector’s internal telemetry to a Collector dedicated to handling internal telemetry from other Collectors in your fleet.
Here’s a sample configuration:
service:
telemetry:
resource:
k8s.namespace.name: "${env:K8S_POD_NAMESPACE}"
k8s.pod.name: "${env:K8S_POD_NAME}"
k8s.node.name: "${env:K8S_NODE_NAME}"
metrics:
level: detailed
readers:
- periodic:
interval: 60000
exporter:
otlp:
protocol: http/protobuf
temporality_preference: delta
endpoint: https://${OTEL_COLLECTOR}
logs:
level: info
output_paths: ["stdout"]
error_output_paths: ["stderr"]
processors:
- batch:
exporter:
otlp:
protocol: http/protobuf
endpoint: https://${OTEL_COLLECTOR}
traces:
processors:
- batch:
exporter:
otlp:
protocol: http/protobuf
endpoint: https://${OTEL_COLLECTOR}
Looks similar to option #3, except that your endpoint is another OTel Collector.
The advantage to this approach is that if you have multiple Collectors (which you likely will), you can funnel them through to a single Collector which can do additional processing (if you want), batch the telemetry from all the different Collectors together, and then export the internal telemetry to your Observability backend for analysis. This prevents your Observability backend from being bombarded by telemetry data from multiple Collectors.
This approach is the best of both worlds. You avoid the self-monitoring loop while not bombarding your Observability backend with data from too many sources.
Further considerations with internal Collector telemetry
I know that I’ve covered a lot, but I did want to call out a few things to keep in mind when you’re configuring internal Collector telemetry.
Limit what you export
Emitting Collector telemetry is great, but make sure that you emit only what you need. Do you need your internal telemetry Collector logs.level
configuration to be set to debug
? Probably not. Similarly, consider tweaking your metrics interval to limit how often you’re exporting metrics.
Remember that the more data you export, the more load it places on your Collector’s CPU and memory, no matter which method you’re using to export your Collector’s internal telemetry. More CPU and memory translates to more energy and more infrastructure cost.
Being mindful of the volume of telemetry you export can also help keep costs down for data ingest into your backend, depending on their pricing model.
Avoid the self-monitoring loop
As I mentioned earlier, the self-monitoring loop means that you send the Collector’s internal telemetry to itself, which puts a greater load on itself, because you’re putting your telemetry data through the Collector’s own pipeline. Exporting directly to a backend lessens the load, because you’re not putting your telemetry through the Collector’s own pipeline.
The self-monitoring loop can increase CPU and memory, driving cloud costs up. In addition, increased CPU and memory uses more energy, which is bad for the environment.
Plus…the subject under monitoring shouldn’t also be the monitor. Here’s a snippet from maintainer Tyler Helmuth from a conversation thread in the OTel Collector Slack channel that sums it up:
“If the collector is responsible for receiving and processing its own telemetry via one of its traces/metrics/logs pipeline and it is experiencing an issue, like memory_limiter blocking data, then the collector’s internal telemetry won’t be exported to the desired destination and you wont know the collector is having an issue. Similarly if the collector is having an issue and producing additional telemetry for that issue, such as error logs, and you send those logs back into the collector it’ll create a cycle of more error logs -> more logs to ingest -> more error logs and so on.”
Couldn’t have said it better myself!
Final Thoughts
I have to admit that I put off learning about internal Collector telemetry for a while because it looked hard. Having an actual use case that compelled me to learn about internal Collector telemetry helped me get over that hump, and made me realize that it wasn’t quite so scary.
Getting my mind around the three different approaches of sending internal Collector telemetry had my head spinning for a bit, but leaning on the OpenTelemetry community for help didn’t disappoint. I’ve said it before, and I’ll say it again. The OpenTelemetry community is absolutely lovely and thoughtful. Every time I ask a question on one of the OTel channels (the #otel-collector channel, in this case) on CNCF Slack, I am met with kindness, patience, and helpfulness. I am grateful for that, because it encourages people like me to ask questions and learn, and enables me to share my learnings with you. I’m also working on a pull request to help clarify all this stuff in the OTel docs, so stay tuned!
Hopefully now that you have a better understanding of internal Collector telemetry configuration, it won’t be scary for you to try this out for yourself.
And now, please enjoy this photo of my rat, Katie.
Until next time, peace, love, and code. 🖖💜👩💻