Tuning Jaeger’s performance
Jaeger was built from day 1 to be able to ingest huge amounts of data in a resilient way. To better utilize resources that might cause delays, such as storage or network communications, Jaeger buffers and batches data. When more spans are generated than Jaeger is able to safely process, spans might get dropped. However, the defaults might not fit all scenarios: for instance, agents running as a sidecar might have more memory constraints than agents running as a daemon in bare metal.
This post will help you fine tune your individual Jaeger components to better match the needs of your particular deployment scenario.
Together, the sampler type and parameter specify how often traces should be “sampled”, ie, recorded and sent to the Jaeger backend. For applications generating a large number of spans, setting the sampling type to
probabilisticand the value to
0.001(the default) will cause traces to be reported with a 1/1000th chance. Note that the sampling decision is made at the root span and propagated down to all child spans.
For applications with low to medium traffic, setting the sampling type to
constand value to
1will cause all spans to be reported. Similarly, tracing can be disabled by setting the value to
0, while context propagation will continue to work.
NOTE: Some clients support the setting
JAEGER_DISABLEDto completely disable the Jaeger Tracer. This is recommended only if the tracer is behaving in a way that causes problems to the instrumented application, as it will not propagate the context to the downstream services.
Most of the Jaeger clients, such as the Java, Go, and C#, buffer spans in memory before batching them to the Jaeger Agent/Collector. The maximum size of this buffer is defined by the environment variable
JAEGER_REPORTER_MAX_QUEUE_SIZE(default value: about
100spans): the larger the size, the higher the potential memory consumption. When the instrumented application is generating a large number of spans, it’s possible that the queue will be full causing the client to discard the new spans.
In the common scenarios, the queue will be close to empty, as spans are flushed at regular intervals to the agent or collector or when the concrete senders, such as the
UdpSender, deem that a flush is needed.
NOTE: The detailed behavior of this queue is described in this GitHub issue.
The Java, Go, NodeJS, Python and C# clients allow the customization of the flush interval (default value:
1000milliseconds, or 1 second) used by the reporters, such as the
RemoteReporter, to trigger a
flushoperation, sending all in-memory spans to the agent or collector. The lower the flush interval is set to, the more frequent the flush operations happen. As most reporters will wait until enough data is in the queue, this setting will force a
flush operation at periodic intervals, so that spans are sent to the backend in a timely fashion.
When the instrumented application is generating a large number of spans and the agent/collector is close to the application, the networking overhead might be low, justifying a higher number of flush operations. When the
HttpSenderis being used and the collector is not close enough to the application, the networking overhead might be too high so that a higher value for this property makes sense.
Server queue sizes
The set of “server queue size” properties (
processor.zipkin-compact.server-queue-size) indicate the maximum number of span batches that the agent can accept and store in memory. It’s safe to assume that
jaeger-compactis the most important processor in your agent setup, as it’s the only one available in most clients, such as the Java and Go clients.
The default value for each queue is
1000span batches. Given that each span batch has up to 64KiB worth of spans, each queue can hold up to 64MiB worth of spans.
In typical scenarios, the queue will be close to empty (metric
jaeger_agent_thrift_udp_server_queue_size) as span batches should be quickly picked up and processed by a worker. However, sudden spikes in the number of span batches submitted by clients might occur, causing the batches to be queued. When the queue is full, the older batches are overridden causing spans to be discarded (metric
The set of “processor workers” properties (
processor.zipkin-compact.workers) indicate the number of parallel span batch processors to start. Each worker type has a default size of
10. In general, span batches are processed as soon as they are placed in the server queue and will block a worker until the whole packet is sent to the collector. For agents processing data from multiple clients, the number of workers should be increased. Given that the cost of each worker is low, a good rule of thumb is 10 workers per client with moderate traffic: given that each span batch might contain up to 64KiB worth of spans, it means that 10 workers are able to send about 640KiB concurrently to a collector.
Similar to the Agent, the Collector is able to receive spans and place them in an internal queue for processing. This allows the collector to return immediately to the client/agent instead of waiting for the span to make its way to the storage.
2000) dictates how many spans the queue should support. In the typical scenario, the queue will be close to empty, as enough workers should exist picking up spans from the queue and sending them to the storage. When the number of items in the queue (metric
jaeger_collector_queue_length) is permanently high, it’s an indication that either the number of workers should be increased or that the storage cannot keep up with the volume of data that it’s receiving. When the queue is full, the older items in the queue are overridden, causing spans to be discarded (metric
IMPORTANT: The queue size for the agent is about span batches, whereas the queue size for the collector is about spans.
Given that the queue size should be close to empty most of the time, this setting should be as high as the available memory for the collector, to provide maximum protection against sudden traffic spikes. However, if your storage layer is under-provisioned and cannot keep up, even a large queue will quickly fill up and start dropping data.
Items from the span queue in the collector are picked up by workers. Each worker picks one span from the queue and persists it to the storage. The number of workers can be specified by the setting
50) and should be as high as needed to keep the queue close to zero. The general rule is: the faster the backing storage, the lower the number of workers can be. Given that workers are relatively cheap, this number can be increased at will. As a general rule, one worker per 50 items in the queue should be sufficient when the storage is fast. With a
2000, having about
40workers should be sufficient. For slower storage mechanisms, this ratio should be adjusted accordingly, having more workers per queue item.
Although performance tuning the individual components is important, the way Jaeger is deployed can be decisive in obtaining optimal performance.
Scale the collector up and down
Use the auto-scaling capabilities of your platform: the collector is nearly horizontally scalable so that more instances can be added and removed on-demand. A good way to scale up and down is by checking the
jaeger_collector_queue_lengthmetric: add instances when the length is higher than 50% of the maximum size for extended periods of time. Another metric that can be taken into consideration is
jaeger_collector_in_queue_latency_bucket, which is a histogram indicating how long spans have been waiting in the queue before a worker picked it up. When the queue latency gets higher over time, it’s a good indication to increase the number of the workers, or to improve the storage performance.
Make sure the storage can keep up
Each span is written to the storage by the collector using one worker, blocking it until the span has been stored. When the storage is too slow, the number of workers blocked by the storage might be too high, causing spans to be dropped. To help diagnose this situation, the histogram
jaeger_collector_save_latency_bucketcan be analyzed. Ideally, this histogram would show a nearly constant value. When the histogram shows that most spans are taking longer and longer over time, it’s a good indication that your storage might need some attention.
Place the agents close to your applications
The agent is meant to be placed on the same host as the instrumented application. This is typically accomplished by having one agent per bare metal for traditional applications, or as a sidecar in container environments like Kubernetes, as this helps spread the load handled by agents with the additional advantage of allowing each agent to be tweaked individually, according to the application’s needs and importance.
Consider using Apache Kafka as the temporary storage
Jaeger can use Apache Kafka as a buffer between the collector and the actual backing storage (Elasticsearch, Apache Cassandra). This is ideal for cases where the traffic spikes are relatively frequent (prime time traffic) but the storage can eventually catch up once the traffic normalizes. For that, the storage should be set to `kafka` in the collector and the Jaeger Ingester component can be used, reading data from Kafka and writing it to the storage.