Sampling, verbosity, and the case for (much) broader applications of distributed tracing

Ben Sigelman
LightstepHQ
Published in
5 min readNov 17, 2021

Originally posted as this twitter thread.

0/ Every software org that’s tried to scale distributed tracing has probably wrestled with sampling.

And yet the standard approach to sampling is needlessly narrow and limiting! What if we step back and frame things in terms of use cases, queries, and verbosity?

1/ So, first things first: the only reason anyone cares about sampling is that distributed tracing can generate a vast amount of telemetry, and we need to be thoughtful about how we transmit, store, and analyze it.

2/ When we prototyped Dapper at Google in 2004, we realized we needed something to reduce costs (and, because we stupidly created kernel contention by writing the data to local disk, overhead).

We did the simplest thing: uniform random sampling.

3/ This was fine for a prototype, but I regret that we stopped there. And I really regret that we wrote it up in the Dapper paper and made it seem like the only reasonable answer to the cost and overhead problem we were solving in the first place.

4/ Really, if we’re concerned with costs, we must think about trace verbosity as well as sampling. Further, rather than framing sampling as a single decision per trace, we should think about it in terms of the query visibility of the underlying trace data.

Like this:

5/ First let’s consider “trace verbosity.”

The common misunderstanding is that distributed traces should be any certain size. I don’t just mean that they should have a certain number of services or spans in them, but that the size-per-span has a “correct” or fixed size.

6/ There is (a lot of!!!) benefit in recording nothing more than the path a trace takes through a distributed system — the core trace context. There is additional benefit in recording the transaction attrs (e.g., “customer ids”) and resource attrs (e.g., “container ids”).

7/ And there is yet more benefit — albeit along with added cost — in recording a verbose “micro-log” of events that take place within the span’s lifetime and context. This is a superset of a “normal” log and can be enormously useful when debugging (more later).

8/ On the other axis, we have “query visibility.” This is where sampling comes into play, but here we expose three sampling rates, not one.

9/ The Dapper paper touches on having multiple sampling rates, too: there were 1/10th as many “historical” traces as “recent” traces that were actually visible to queries (and thus useful for, well, “anything at all”).

10/ More modern approaches (e.g., @LightstepHQ where I work) go much further in that the sampling is not random at all, but rather optimized for customer queries.

This non-uniform sampling is a game-changer, as data density is proportional to utility (rather than randomness).

11/ And what happens when we look at the cross-product of trace verbosity and query visibility?

… And what sorts of example use cases map to that matrix?

In my head it looks something like this:

12/ Even if we’re only using the core trace context data, we can satisfy use cases that are transformative from a resource management standpoint.

For example, Google saves $100Ms per year by rate-limiting multi-tenant storage by aggregating trace context state in real time.

13/ Or we can replace most “request metrics” instrumentation by performing local aggregations — and tail-sampling exemplars — of transactions flowing through a single process. We can even inspect detailed+verbose logging of slow transactions that are still in flight!

14/ Or looking in the top-right of the matrix, we can (do what some Lightstep customers have done and) dramatically (≥95%) reduce spend on conventional logging infrastructure.

Tracing can do a better job here from an ROI standpoint:

15/ Stepping back: when people think about tracing today, they’re almost exclusively focused on using span tags and trace context to do latency and error analysis; all while controlling costs with basic uniform-random sampling.

It’s valuable, yet narrow.

16/ However, when tracing telemetry is incorporated into an observability and data platform built for dynamic sampling, cost management, and variable verbosity, we can express the true value of the underlying telemetry.

PS/ This is the sort of innovation that keeps me excited year after year at @LightstepHQ – we are hiring across the entire org, DM me or visit https://lightstep.com/about/careers if you’re interested!

For more threads like this one, please follow me here on Medium or as el_bhs on twitter!

--

--

Ben Sigelman
LightstepHQ

Co-founder and CEO at LightStep, Co-creator of @OpenTelemetry and @OpenTracing, built Dapper (Google’s tracing system).