An Observer’s Guide To The OpenTelemetry Collector

Published in

Engineering the Skies: Qantas Tech Blog

11 min readOct 5, 2023

Following from our previous blog, Unlocking Observability with OpenTelemetry, we will deep dive into the OpenTelemetry Collector component and the various deployment strategies.

What is OpenTelemetry Collector?

OpenTelemetry Collector is a proxy between various sources emanating telemetry data (i.e. traces, metrics and logs) and Observability backend platforms.

The OpenTelemetry project provides the collector as a vendor-agnostic implementation to receive, process and export telemetry data to one or more destinations like databases, cloud storage, open-source or vendor based Observability platforms. Observability platforms then enable us to analyse and visualise the telemetry data in a meaningful way, giving us a bird’s eye view into systems interactions and help us in troubleshooting issues.

When or Why should we use a Collector?

If you are experimenting with OpenTelemetry in your development environment, you will find that sending data directly to the Observability backends is a great way to start. However, introducing a collector in the middle can benefit with use cases like:

Abstracting configuration of one or more observability backends
Applications don’t need to know or configure the vendor specific endpoints or where the telemetry data needs to be exported. If there are three different backends storing each type of telemetry — i.e. traces, metrics and logs — each application will have information about all three and also any access keys to send the data. Collector allows a single highly configurable way to manage multiple backends that can be easily changed when required. The applications need not be burdened with the awareness of telemetry routing destinations.
Filtering data
Let’s say we want to include or exclude spans, metrics or logs with certain attribute values like “http.status”, we can use filtering based processors and write regular expressions for the matching conditions. Please note, filtering spans require a good amount of testing, as it can lead to orphaned-telemetry if the parent spans are dropped.
Transforming data
This allows us to transform telemetry data, for example, we could rename metrics or add, update and delete certain attributes from spans.
Redacting any PII (Personally identifiable information)
For masking any sensitive data attributes from spans, we can redact PII in the collector using redaction processors before sending out to the Observability platforms.
Sampling data
We would most likely send 100% of the telemetry data from various sources to initially understand the overall systems interactions. With data sampling, it’s possible to send only a subset of telemetry data. This can be configured for certain times of the day, environments specific, data with errors or latency sensitive services. Please refer Sampling for more information.
Aggregation of telemetry data
We could aggregate cumulative metrics or data points at different timestamps. We could also aggregate the traces with non 200 HTTP status codes as a metric. Another use case would be aggregation of all the spans belonging to a trace to drop the entire trace. These are all possible to achieve by using various processors available as part of the OpenTelemetry contribution project, which we will talk about further.

While there are advantages of implementing an OpenTelemetry Collector, there are also associated costs and complexity involved in setting up and
maintaining additional infrastructures resources. Deployment, scaling and other architectural challenges are also expected.

In Qantas Loyalty, we have several technology teams focused on various business objectives. They utilise a combination of shared and distinct proprietary tools for application logs and metrics. We saw the need to unify some of these tools to gain an overarching view of our systems and through
distributed tracing, visualise how distinct business systems interact. Incorporating OpenTelemetry collector within our architecture provides us with a way to abstract various systems from the specifics tied to individual vendors. Another major criteria is being able to control what telemetry data should and should not be sent to Observability platforms.

Major components of a Collector

Receivers
A receiver defines how the collector receives the telemetry data. There are different types of receivers for receiving different formats of data.
The OpenTelemetry project developed the OpenTelemetry Protocol (OTLP) to facilitate the exchange of telemetry data between clients and servers.
If the applications are instrumented using OpenTelemetry SDK, they emit telemetry data in OTLP format by design and can be received in the
collector via an OTLP receiver on either gRPC or HTTP.

Processors
Once the collector receives the data, it can be processed for various use cases like batching, filtering out some traces or metrics, adding some
metadata, etc. OpenTelemetry recommends using certain processors.

Exporters
An exporter defines how and where the telemetry data needs to be exported for storage and analysis. We can configure one or more exporters for each type of telemetry data. Exporter components can convert the internal telemetry data format to the external backend ingestion format before sending, for example vendors provide exporters that can convert telemetry data in OTLP format to their vendor specific format.

Extensions
Extension components are mainly for the collector itself and do not access telemetry data. For example, health check extension gives us a way to
monitor the collector’s health. We also have pprof extension for collector’s performance profile.

Collector components configuration

The collector components are highly configurable and are generally configured in a yaml file. We will explain with an example config.yaml below:

receivers:
 otlp:
   protocols:
     grpc:
       endpoint: 0.0.0.0:4317
     http:
       endpoint: 0.0.0.0:4318
       
processors:
  memory_limiter:
    check_interval: 2s
    limit_percentage: 50
    spike_limit_percentage: 20
  batch:
    timeout: 5s
  resourcedetection:
    detectors: [ system, env, ec2, ecs ]
    timeout: 5s
    override: true 
  redaction:
    allow_all_keys: false
    allowed_keys:
      - description
      - name
    blocked_values:
      - "4[0-9]{12}(?:[0-9]{3})?"

exporters:
  otlp/metrics:
    endpoint: ${env:METRICS_ENDPOINT}
  otlp/traces:
    endpoint: ${env:TRACES_ENDPOINT}
    
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
    path: "/health"
  memory_ballast:
    size_in_percentage: 25

Enabling the components
Now that we have all the components defined, let’s enable them by adding services section in the same configuration config.yaml. This
consists of:
- pipelines to enable all the receivers, processors and exporters for each type of telemetry data — traces, metrics and logs. Please note, the sequence of processors determines the sequence in which data is processed.
- extensions enables the configured extensions for the collector.
- telemetry section is for the collector’s telemetry data. For example log level for the collector’s logs and metrics.

service:
  telemetry:
    logs:
      level: ${env:LOG_LEVEL}
      encoding: json
  extensions: [ health_check, memory_ballast ]
  pipelines:
    traces:
      receivers: [ otlp ]
      processors: [ memory_limiter, redaction, resourcedetection, batch ]
      exporters: [ otlp/traces ]
    metrics:
      receivers: [ otlp ]
      processors: [ memory_limiter, resourcedetection, batch ]
      exporters: [ otlp/metrics ]

OpenTelemetry collector distributions

The OpenTelemetry project provides us two pre-built binary distributions of the collector along with a way to create a custom collector.

Core OpenTelemetry collector (otelcol)
This has all the core components designed as part of OpenTelemetry project. For example, OTLP receivers and exporters, some processors like batch processor. Please refer to all the components available as part of core repository.
OpenTelemetry collector contribution (otelcol-contrib)
This contains all the components contributed by the open-source community including the core components. There are multitude of processors, vendor based receivers and exporters available as part of the contrib repository.
Custom collector
OpenTelemetry also provides us a way to build our own custom collector in which we can include all the components that we require from the core or contrib repository and also create our own components. We can do this via OpenTelemetry Collector Builder(ocb) utility.

Running the pre-built distributions locally
We can use the docker images available for otelcol or otelcol-contrib to run locally with our applications.
Please refer to our previous blog for how to run collector with the instrumented applications. Here is an example to run the collector using docker compose and the earlier defined configuration config.yaml.

otel-collector:
  image: otel/opentelemetry-collector-contrib
  container_name: otel-collector
  ports:
    - "4317:4317" # OTLP gRPC receiver
    - "4318:4318" # OTLP HTTP receiver
    - "8889:8889" # Prometheus metrics exposed by the agent
    - "8888:8888" # Prometheus metrics exposed by the agent
    - "14250" # Jaeger grpc receiver
    - "14268" # Jaeger http thrift receiver
    - "55678" # OpenCensus receiver
    - "9411" # Zipkin receiver
    - "1777:1777" # pprof extension
    - "55679:55679" # zpages extension
    - "13133:13133" # health_check extension
  volumes:
    - ./collector:/etc/otel
  environment:
    TRACES_ENDPOINT: "destination for traces"
    METRICS_ENDPOINT: "destination for metrics"
    LOG_LEVEL: "debug"
  command: ["--config=/etc/otel/config.yaml"]

Collector deployment patterns

We can deploy a collector in multiple ways. Below section describes the pros and cons which could help decide what suits best for your infrastructure. We will describe the deployment patterns in the context of AWS Elastic Container Service (ECS) with EC2 instances.

Agent

A collector that runs with the application or on the same host as the application.
In ECS with EC2 context, we could be running collector(s) as a sidecar with the application container provisioned in the same task definition. We could also provision collector(s) in the same EC2 host as the running applications by using “daemon” scheduling strategy. This strategy allows us to deploy one Collector per EC2 host in the ECS cluster.

For Kubernetes workloads, collectors in agent mode could be deployed either as sidecar or using DaemonSet.

Pros

Simple to setup and begin with. Running a collector container along with an instrumented application container, allows the application to directly send telemetry data to the collector running on localhost or the linked container name. So there is no need for service discovery as the containers share the same network.
There is no need for configuring batching, retry or any encryption on the application level, as the collector is closely running on the same network.
We can re-use the sidecar containers by adding them with multiple services. We could also scale the sidecar containers separately from the main application.
Running the collector in the same host as the application can give us the most monitoring insights into both the container and host metrics, for example containers CPU usage and memory. This could be helpful in troubleshooting issues during excessive loads.
Implementing tail-based sampling is easier with the sidecar approach, as the telemetry data from application always routes to the same collector instance.

Cons

Running collectors as sidecar increases the number of infrastructure resources and hence possibly the cost. Therefore, if there are 200 services, this will result in the deployment of 400 active containers. Another criteria to consider is scaling the number of EC2 instances or even opting for larger EC2 instance types within the ECS cluster to accommodate the increased loads.
As the containers run in coordination to the main application, any changes to either would result in re-deployment of the whole task. If the collector has changes like versioning or configuration updates and its used as with multiple applications, we need to consider the re-deployment of all the applications.

Gateway

Collector(s) running as standalone service(s) provisioned per cluster, data centre or region. This will usually involve running the collectors behind a load balancer to distribute the load.

Below are a few ways of deploying collector as an independent service:

Within AWS ECS with EC2 infrastructure, we can deploy collector containers as a service and scale them as needed. We could register the tasks with a load balancer using AWS Application Load Balancer (ALB). Some considerations to note:
- The type and number of collector instances. To determine the desired number and type of active collector container tasks, we need a fair bit of idea about the amount of telemetry data in Production workloads to scale accordingly.
- A variation of this strategy can be used when we have a large amount of telemetry data and can’t decide on the number of collector containers on Production. Using a “daemon” scheduling strategy, we can deploy one collector container per EC2 instance in an ECS cluster. This strategy can also be used to collect host metrics of the EC2 instances for infrastructure monitoring.
Limited egress-only collectors for compliance, security and operational requirements . In a multi-team and multi-environment ecosystem, telemetry data can be sent to the first level of forwarding collectors. These forwarding collectors can then subsequently export the telemetry data to a limited number of regulated egress-only collectors. These collectors serve as the sole egress points to vendor-based Observability platforms.
We could also use a combination of agent collectors sending to gateway collectors and allow only gateway collectors as the egress points.
For tail-based sampling requirements, we can add a special load balancing collector as a first layer receiving telemetry data. The telemetry data can then be forwarded to a second layer of collectors which subsequently export data to Observability backends. The load balancing collector is configured with a load balancing exporter, which aggregates all the spans of a single trace and sends to the same collector instance.

Pros

Can reduce the number of resources and simplify deployment complexity if we have a large number of applications.
Maintenance becomes easier as we can update and deploy the collector independently without re-deploying applications.
Collector configurations, including those for multiple teams or environments, as well as credential management, can be centralised in one location.
We could manage and organise specific set of egress-only collectors for security and operational requirements.

Cons

Additional requirement of TLS configuration and authentication mechanisms in the collector service for applications to send telemetry data securely.
Service discovery is required for the applications to transmit telemetry data.
As the gateway collectors run independently and focus more on aggregating and forwarding telemetry data to the Observability backend, they provide less visibility to the application containers and/or host specific metrics like CPU and other resource utilisation.

Cost implications of the two patterns

Choosing between the two deployment patterns or using a combination of both, agent and gateway, can depend on various factors, including the complexity of your architecture and the number of resources required.

For an infrastructure with a substantial number of active containers, opting for the sidecar collector approach could potentially result in higher amount of resources and hence costs and maintenance.
Running collector as a service might provide cost savings in some cases by potentially requiring fewer instances to manage. However, as mentioned earlier, the trade-off involves the added complexities, like service discovery etc.

It’s useful to perform a cost analysis based on anticipated data volume, deployment scale, and cloud provider’s pricing model to determine which approach aligns better for your monitoring needs.

In our journey at Qantas Loyalty, we initially started with the sidecar strategy. We soon realised, maintenance is one of the most important factor for us as it involves multi-team and multi-environment configurations. Hence, we decided on a single code base to build the collector deployable with multi-team configurations. We prefer deploying it as a service (gateway pattern) in most of the teams infrastructure, while leaving it up to individual teams to use it for any agent pattern use cases.

Information has been prepared for information purposes only and does not constitute advice.