Dzero Labs
Published in

Dzero Labs

Unpacking Observability: Understanding Logs, Events, Traces, and Spans

Hawaiian Sunset. Image by Adri V

I’ve spent the last few weeks trying to wrap my head around Observability, consuming every book, article, and podcast that I could get my hands on. My most recent explorations have gotten me digging into OpenTelemetry. OpenTelemetry (or OTel for short) is an open-source framework for instrumenting code, and many of the major Observability vendors such as Datadog, Lightstep, and Honeycomb support it. It’s vendor-agnostic, so if you choose to switch Observability vendors, you won’t be royally screwed. I’m in charge of the Observability team at my current company, and my goal is to have the organization follow best practices around Observability. Among other things, this means steering the organization towards adopting OpenTelemetry for instrumentation.

Before jumping into using OpenTelemetry, it’s important to understand core concepts, such Spans and Traces. But what about Logs? Where do these fit in? How about Events? Most of the literature I’ve read about Observability talks about wide Events and deep Traces; however, OpenTelemetry docs don’t seem to put a huge emphasis on Events in the same way. Was I missing something?

So, of course, I decided to do some digging, asking questions in the Observability community and reading all sorts of online docs (see references below) to try to understand things better. The purpose of this blog post is to educate you in the differences between Logs, Events, Spans, and Traces so that you can start digging into OpenTelemetry.

Logs

Logs are human-readable flat text files that are used by developers to capture useful data. Logs messages occur at a single point in time (though not necessarily at every point in time).

Unfortunately, log formats aren’t standardized across languages or frameworks, and they can be hard to parse and challenging to query. It’s also hard to group related logs together.

Events

Events are structured logs. They follow a standardized format (JSON), and are waaaay easier to query.

Behold a sample log:

Source: The Path from Logs to Traces, by Alex Vondrak

And its Event counterpart:

Source: The Path from Logs to Traces, by Alex Vondrak

Spans

A Span represents a unit of work. They can be thought of as the work being done during an operation’s execution.

Logs represent occurrences at a specific point in time. Events aren’t that much more useful, other than being easier to read and query. The problem is that in isolation, Events don’t really tell a story. What if instead, we captured info for a given block of time (i.e. a time span)?

Suppose we had the scenario below:

Source: The Path from Logs to Traces, by Alex Vondrak

In the olden days, we’d have log that looked like this:

Source: The Path from Logs to Traces, by Alex Vondrak

We have a span that looks like this:

Source: The Path from Logs to Traces, by Alex Vondrak

Um…letdown? Yeah…if you only had those 3 fields, it would for sure be a letdown. In order for the Span to be more useful to us, we need some additional information. In OpenTelemetry, we can also include the following metadata in our Spans:

  • Operation name: The name of the microservice being executed, or a function call
  • Start timestamp
  • End timestamp (or duration)
  • Attributes: (Optional) List of key-value pairs used for aggregation or for filtering trace data (e.g. customer identifier, process hostname). Used to describe and contextualize the work being done under a Span.
  • Events: (Optional) Time-stamped strings which are made up timestamp, name, and (optional) Attributes. Used to describe and contextualize the work being done under a Span.
  • Parent ID: Unique identifier of the Span’s parent
  • Links: (Optional) References to other causally-related Spans

Now with the above metadata, we’ve got the proper context which helps us paint a picture of what happens during that operation.

Trace

Traces are also known as distributed traces. They traverse network, process, and security boundaries, to give you a holistic view of your system.

A Span is the basic building block of a Trace. A Trace is made up of a tree of Spans, starting with a Root Span (i.e. Span with no parent), which encapsulates the end-to-end time that it takes to accomplish a task. The Root Span represents a single logical operation, such as clicking a button to add an item to a shopping cart.

Below are a few examples of trace visualizations using Lightstep and Honeycomb.

Example 1: Trace visualization in LightStep

Source: The Path from Logs to Traces, by Alex Vondrak

Example 2: Trace visualization in Honeycomb

Source: The Path from Logs to Traces, by Alex Vondrak

Conclusion

In the world of Observability, Spans and Traces reign supreme. What we’ve learned:

  • Logs tell you about something at a particular point in time. They don’t have a standardized format, and are therefore hard to query.
  • Events are structured logs (JSON), and are easier to query.
  • Spans represent an operation. They paint a picture of what happened during the time in which that operation was executed, through contextual information such as associated Events and attributes.
  • A Root Span is a Span without a parent, and represents your high-level operation (e.g. clicking a button to add item to a shopping cart).
  • Traces stitch all related spans (as a tree) together to tell you the whole story.

I shall now reward you with a picture of a calf.

Photo by Sean Nyatsine on Unsplash

Peace, love, and code.

Acknowledgements

I wanted to give a big shoutout to the Observability community on the Honeycomb Pollinators Slack. Folks there have been super responsive and patient with my many questions. I really appreciate it. Also, a shout-out to Alex Vondrak, who put together a great set of slides which clarified a LOT of this stuff for me.

I would also suggest that you reach out to other Observability user communities. I figure that it’s always good to get different points of view from the community! Datadog, for example, also has a Slack user community, and Lightstep has a Discord user community.

More from the Unpacking Observability Series

References & Resources

--

--

--

What started off as a DevOps problem turned out to be an Ops problem.

Recommended from Medium

NFT- Weekly Digest | Week 10

Technology agnostic metadata driven orchestration framework for any cloud architecture (example…

Discover Deterministic Builds with C/C++

The paradigm of Change of state vs Change a whole block

Counter Random Payment

Automatic Speaket Volume Control on Phone Call Detection

Hostname from IP Address

Test Scheduler, Chromium in iOS, & New Browsers for Mobile and Desktop

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Adri Villela

Adri Villela

I push the boundaries of software delivery by learning from smart people who challenge the status quo. Former corporate automaton.

More from Medium

What made SLOs so messy (and what we can do about it)

Just-in-Time Nomad: Running the OpenTelemetry Collector on Hashicorp Nomad with HashiQube

The nomad logo on a dark Tucows blue background

5 Key Observability Trends for 2022

Testing Microservices: You’re Thinking About (Environment) Isolation All Wrong