Sampling in Observability
You might come across the mention of sampling in the context of various observability signals such as traces, events or profiles.
Sampling, or downsampling, is reducing the sampling rate of a signal. For example, if you only trace one request for every 100 requests, you are downsampling your tracing.
This is done for the following primary reasons:
- To reduce the cost of data collection.
- To reduce the size of the collected data.
- To reduce the size of transferred data.
Downsampling is often a challenging job and most instrumentation libraries often implements very trivial strategies. Before explaining further how sampling strategies are chosen, I will briefly explain when sampling is not an ideal solution.
Above, there is an example comparison of various different observability signals. We are plotting how expensive is to collect them and how large is the collected data. The blue area roughly represents what’s feasible and scalable for our organization. Anything outside of the region has to be downsampled.
When not to downsample?
How NOT to measure latency by Gil Tene is an excellent introduction if you are not familiar with the concept of collecting data to analyze the 90th, 95th, 99th percentile.
In cases where statistical distribution is important such as latency or error metrics, we don’t sample. Instead of sampling, we aggregate data on the process and only transfer the aggregated statistical summary. Even though, there is a lot of public debate on how metric collection is a tool of the past, metric collection is hard to replace for this category of tasks.
Strategies such as generating metrics from events and traces on the process is possible often only with a sampling rate of 100%. In some cases, 100% sampling might be feasible. In other cases, sampling can be applied only when it comes to transfer the data out of the process. Even though we are not reducing the cost of the data collection, we can reduce the size of transferred data. For example, we can collect all traces to generate latency metrics but can only upload 1% of the collection to the tracing backend.
What do sampling APIs look like?
Instrumentation libraries often allow you to set you sampling strategies or rates. If instrumentation library is not doing it for you, you might need to implement sampling yourself.
Go’s runtime.SetCPUProfileRate is an example of a downsampling API. It allows you to set the CPU profiling rate.
Zipkin’s various samplers, such as the counting sampler, allow you to downsample your traces by setting a rate.
The sampling decision
Sampling decision is very personal. Over time, we learned that there is no one easy solution to this problem. Also, subcomponents of the same system sometimes require different strategies.
- A low-traffic background job can collect signals for every task.
- A low-latency-tolerant handler might need to aggressively downsample when traffic is high.
- A handler might only want to sample if certain conditions are met.
It is a wise decision to think about sampling more than just a global setting. Users often want to use different strategies or strategies with different configuration for different parts of their code in the same process. It is also useful to allow users to implement their own sampling strategies.
Another wise decision is to allow sampling strategy to be dynamically configurable. In cases where there is a production issue, you might want to tweak the sampling to better understand what is going on.
Propagating the sampling decision
In cases where collected data is tracking a system end-to-end and collection is spanning more than one process, like distributed traces or events, you want to propagate the sampling decision.
Above, you see a trace for /messages. This endpoint is making four other calls. In order to have end-to-end traces, sampling decision at /messages is carried over to other services as a part of the call. So, each of these systems, can use the sampling decision from the parent.
Propagation of the sampling decision is often done by putting the decision in the header of the outgoing request from parent to children.
When collecting data is not expensive but transferring and storing is, there are more flexible approaches. Data can be collected 100% a time and filtering mechanism later can be applied to drop the uninteresting samples. This allows us to minimize the transferred data but representing a more diverse set of samples. Having samples from edge cases specifically might be important to troubleshoot and debug problems.
Another alternative is to downsample the bulky parts of the data. For example, a trace span can easily grow in size because of the annotations. Rather than downsampling the span, you might just want to downsample the annotations. The rest of the data can be used to generate latency metrics, dependency graphs and more.
Don’t trust the sampling decision if it is propagated from an external system. A malicious actor may DOS attack you by making you sample each time.