Why Tracing Might Replace (Almost) All Logging

Ben Sigelman
LightstepHQ
Published in
4 min readApr 23, 2021

--

Originally posted as this twitter thread.

0/ This is a thread about why tracing will gradually replace most logging, at least where distributed or cloud-native architectures are concerned. And we’re going to explore this through the lens of a relational data model.

It’s going to be fun!

Thread: 👇

1/ The best logging is always structured logging. That is, logging statements are most useful if they encode key:value pairs which can then be queried and analyzed in the aggregate.

(Even for plain, textual logs, NLP and stats can extract basic structure.)

2/ A structured log implicitly defines a relational table, with the keys for each attribute defining the columns, and the values for each log line defining rows in this (theoretical) table.

Like this:

3/ And, naturally, there are a number of implicit columns in our table as well. Things like host, timestamp, etc:

4/ Now, to be clear, we’re talking about the “abstract idea” of relational tables here, and not actually inserting every log line into mysql or similar — that would be a disaster at scale. :)

Just think of each line of logging instrumentation as a “table schema.”

5/ Once we realize this, we can write queries with most SQL niceties (WHERE filters, GROUP BY aggregations, etc).

But what about “JOIN”? How does that work in logging systems? The long answer won’t fit here.

The short answer? “Poorly.” Bummer. :-/

6/ Why is it a bummer? Well, because when we’re instrumenting a microservice, by definition we only have access to data from that microservice!

What about version numbers of peer services? Or request customer_ids? Or downstream feature flags? Surely those could be relevant…

7/ But relevant or not, that data lives in other services. Which means it’s not there to log. What’s an eng to do??

Faced with this conundrum, engineers stuck with logs will inevitably/sadly hack something together rather than address the underlying structural issue. (😭)

8/ E.g., have you ever seen a customer_id painstakingly propagated across function and process boundaries just so someone can add it to instrumentation?

That’s an error-prone and expensive way of implementing log JOINs via app code (rather than automatically via tracing).

9/ When we implement JOIN manually in this way, we are taking on literally the hardest part of distributed tracing instrumentation (namely, “context propagation”) and trying to manage it via one-off hacks. It doesn’t end well. (TL;DR “use OpenTelemetry instead”)

10/ So again, “that’s wasteful.” And ineffective.

The right way to solve this problem is to leverage distributed tracing to perform a much (much) more powerful JOIN.

Let’s imagine that your system looks like this:

11/ When a truly modern observability solution “assembles a trace,” it’s really executing a JOIN across the entire distributed transaction, and thus populating a wider and more powerful table: one with columns from every Span that participates in the trace.

Like this:

12/ Now, when people think about tracing, they tend to think about this giant table “one trace (or row) at a time.”

Imagine restricting a logging system to display only one log-line at a time. This is just as bad… perhaps worse. And yet it passes for “tracing.” :-/

13/ It’s really only in the past few years that observability technology has developed to the point that these massive, distributed, tables can be hydrated both dynamically and in real-time.

14/ And all of that data engineering is worth it! Because when the relational tables are as wide as your distributed system is deep, amazing things are possible — and I don’t see how logging will ever be able to catch up.

PS/ For example applications of these sorts of dynamic, relational tables, see any of the following (or play with http://lightstep.com/sandbox)

https://twitter.com/el_bhs/status/1364282343196827650… https://twitter.com/el_bhs/status/1227358990968877056… https://lightstep.com/blog/announcing-lightsteps-change-intelligence/

For more threads like this one, please follow me here on Medium or as el_bhs on twitter!

--

--

Ben Sigelman
LightstepHQ

Co-founder and CEO at LightStep, Co-creator of @OpenTelemetry and @OpenTracing, built Dapper (Google’s tracing system).