OpenTelemetry Collection: High availability deployment patterns while using the load-balancing exporter
This article is the first of my series covering different aspects of observability data collection using OpenTelemetry. This series aims to document and provide examples for some of the rougher edges when utilizing the OpenTelemetry collector at scale.
We’ll be kicking the series off by covering two patterns for high availability and their implications for scaling the OpenTelemetry collector when it’s configured to use the load-balancing exporter.
So what is the load-balancing exporter?
The load-balancing exporter is an exporting extension provided by the OpenTelemetry collector to enable layer 7 Trace-ID aware load-balancing.
The core components that make Trace-ID load-balancing possible are:
- A DNS resolver to discover new backend collector instances. You can view the DNS resolver code here.
- Consistent Hashing to distribute traces to the appropriate backend based on Trace-ID. You can view the consistent hashing code here.
First, we need to define the base architecture of our observability ingestion pipeline. Our pipeline utilizes a collector deployed as a Layer 7 Trace-ID aware load-balancer. Trace-ID aware load-balancing ensures traces don’t fragment, meaning that the spans that make up a trace arrive at the same backend collector.
Here we are using a single collector to load-balance across the horizontally scaling set of backend collectors. This load-balancer ends up being a single point of failure for our ingestion system and we are hoping to avoid that by utilizing some common scaling/HA patterns. Also, the single load-balancer collector ends up needing to be vertically scaled as traffic to our ingestion increases, this is not the optimal way to scale out this service as we end up with downtime when adding resources.
Pattern #1 - Active + Standby
We can improve our initial design by adding a warm replica that will automatically be promoted to active if something happens with the current active load-balancer.
We can achieve this by updating our deployed architecture to something like this. Essentially using a sidecar and readiness check for simple leader election on our collector containers.
Pattern #2 - Scaling past 1 active replica
We can improve even further on our initial design by increasing the number of replicas we have deployed. This solution provides horizontal scaling and rolling upgrades but also brings with it some tricky bits that are worth considering.
- Backend churn could cause traces to be fragmented as each collector configured as a load-balancer holds a consistent hash that is updated from the DNS resolver. If backend collectors are rapidly crashing and scaling there could be points where each load-balancers become confused and route traces incorrectly.
- When the load-balancers collectors scale there could be a short period where the traces received by the new load-balancer are routed incorrectly. This is due to a possible period that happens between when the collector starts accepting traffic and the DNS resolver has not yet created a consistent set of backend collectors that match the rest of the load-balancers.
Overall there are several steps we can take to design a resilient collection architecture that utilizes the load-balancing exporter. The use of the patterns provided above can help build out an observability ingestion system that has zero downtime scaling and deployments.