TEMPLE: Six Pillars of Observability

Yuri Shkuro
8 min readSep 19, 2022

--

A temple in Italy with six front pillars
Valley of the Temples, Agrigento, AG, Italy. Photo by Dario Crisafulli on Unsplash.

In the past few years, much has been talked and written about the “three pillars of observability”: metrics, logs, and traces. A Google search for the phrase brings up over 7,000 results, with almost every observability vendor having a blog post or an e-book on the topic. Recently, the term MELT started showing up that adds “events” to the mix as a distinct telemetry signal. In this post, I want to show that there are even more distinct types and introduce TEMPLE, which stands for traces, events, metrics, profiles, logs, and exceptions. I call them six pillars of observability, on one hand to make fun of the previous terms and acronyms, but also to make the case that these signals serve distinct use cases for observability of cloud-native systems. If I fail at the latter and you don’t buy my arguments, at least the TEMPLE acronym works much better with “pillars” 😉.

Attribution: One of my colleagues at Meta, Henry Bond, started using the acronym TEMPL in the internal documents. I added “Exceptions”, for completeness, and ended up with TEMPLE.

Six Pillars Explained

I will try to illustrate why I think these six telemetry types deserve to be considered as separate. It does not mean some of them cannot be supported by the same backend, but they differ in the following aspects:

  • How each telemetry type is produced
  • Which unique storage requirements they impose
  • How they are used in the user workflows

Even though the TEMPLE acronym implies a certain ordering of the signals, I do not ascribe any meaning to that other than to make up a pleasant word. For better continuity of the explanation, I will go through them in a different order.

Also, naming is hard. I will point out how, surprisingly, most of the terms we use in this space are ambiguous, and the boundaries between telemetry types are not as strict as they appear to be.

Metrics, the original pillar. Numerical measurements with attributes, which are easily aggregatable both spacially (along the attribute dimensions) and temporally (combining values into less discrete time intervals). Metrics aggregates remain highly accurate, which makes them great for monitoring, but aggregations lose the original level of details, which makes metrics not as good for troubleshooting & investigations.

In the context of cloud native applications, metrics usually refer to operational metrics of the software. Business metrics are actually a different category that is better captured via structured logs.

Logs, the ancient pillar (if you question which came first, “ancient” or “original”, take it up with Marvel). Logs are a confusing category, ranging from arbitrary printf-like statements (aka unstructured logs) to highly structured and even schematized events. When structured logs are schematized (cf. Schema-first Application Telemetry), they are often sent to different tables in the warehouse and used for analytics, including business analytics. Schema-free structured logs are what Honeycomb calls “arbitrarily-wide events”. It’s a bit of a misnomer, because each individual log record is not “arbitrarily wide”, in fact it usually has a completely static shape that will not change in the given call site, but when we mix these log records from different call sites we can expect all kinds of shapes, making the resulting destination an “arbitrarily wide” table. Most modern logging backends can ingest structured logs and allow search and analytics on these arbitrary dimensions.

Many logs are generated in response to a service processing specific input requests, i.e. they are request-scoped. We, as an industry, haven’t really figured out how best to deal with request-scoped logs. On one hand, they look like other logs and can be stored and analyzed in a similar fashion. On the other hand, distributed tracing is specifically designed as request-scoped logging, so these logs could be much more useful when viewed in the context of a distributed trace. I met engineers whose teams chose to use tracing APIs exclusively to capture logs, so that those can be always visualized in the context of traces. A common approach to solve this is to capture trace ID as a field in the logs and build cross-correlation between logging and tracing tools.

Traces, the “new cool kid on the block” pillar. The term tracing is quite overloaded (just look at Linux Tracing documentation). In the context of cloud native observability, tracing usually refers to distributed tracing, or a special form of structured logs that are request-scoped, or more generally workflow-centric. In contrast to plain structured logging, tracing captures not only the log records, but also causality between them, and once those records are stitched into a trace they represent the trajectory of a single request (or a workflow) through a distributed system end-to-end.

Tracing opens up a realm of possibilities for reasoning about a system:

  • Unique monitoring capabilities, e.g., an end-to-end latency of messaging workflows, which is difficult to observe with any other telemetry.
  • Debugging capabilities, in particular, root cause isolation. Traces may not always tell you why a system is misbehaving, but they often can narrow down which of the thousands of distributed components is at fault.
  • Resource usage attribution by different caller profiles or business lines.

Events, the misunderstood pillar. This is perhaps the worst-named category of telemetry signals because, strictly speaking, pretty much all telemetry is “events”. What people usually mean by this category is change events, i.e., events that are external to the observed system that cause some changes in that system. The most common examples are: deployments of application code (and the corresponding code commits), configuration changes, experiments, DR-related traffic drains, perhaps auto-scaling events, etc.

There is no practical bound to what could be considered an event that affects the system. For instance, in the early days of Uber, Halloween was the night of the highest user traffic, such that SREs would spend the whole night in the “war room”, monitoring the system and firefighting (Big Bang Theory flashback: Howard: “guidance system for drunk people”, Raj: “they already had that, it’s called Uber”). As the business became more global, the impact of Halloween as US-centric holiday became less pronounced on the system traffic, but one can easily see how holidays, or some other public events like sports or concerts, can become factors that affect the behavior of a system and might be useful to show to the operators as part of the system’s observability.

One could reasonably ask: why can’t we treat events simply as structured logs? As far as the data shape in the storage, there is indeed not much difference. However, logs usually require less rigor from the backends capturing them: some level of data loss may be acceptable, and the pipelines are often set up to down-sample or throttle the logs. For example, if a bug is causing an application to log a certain error message, it’s likely that we’ll have many similar logs, so it’s not critical to guarantee that every one of them is stored. This is very different from the handling of change events, which should be all stored reliably, because if we miss the record about that one single code deployment that caused the issue and needs to be rolled back, our outage investigation might take much longer. Similarly, when querying for logs, it’s usually sufficient to find some samples of a pattern, or to get aggregate statistics, there is not much emphasis on finding a very specific individual log record. But with change events, we’re precisely looking for very specific instances. Finally, change events are usually produced in much fewer volumes than logs. These differences in the requirements often lead to different designs and trade-offs in the logs and events backends.

Profiles, the geek pillar. Profiles are another category of telemetry that is tricky to define, although you would know it when you see one. Profiles are just being introduced as a new signal to the OpenTelemetry scope in the OTEP-212, and even that document had a bit of a tough time defining what a profile is. Its latest definition is “a collection of stack traces with some metric associated with each stack trace, typically representing the number of times that stack trace was encountered”. Mouthful.

Many engineers encounter profiling tasks at some point, but from my experience most of them do not have to deal with profiles very often, unless they specialize in performance and efficiency optimizations. Profiling tools, as a result, tend to be somewhat esoteric, focusing on power users.

Profiles, unlike most other telemetry types, almost never require explicit instrumentation, instead relying on deeper integration with the runtimes to capture the call stacks. They often generate very large amounts of data, requiring specially designed backends.

Exceptions, the forgotten pillar. Finally, let’s not forget the exceptions. Remember the “Stacktrace or GTFO” comics? When I first came across it, I was working at an investment bank developing trading systems, and we had zero observability into the system running in production (we could get access to logs on the hosts, but only by going through a special permissions escalation process, because … lawyers). So the comics resonated with me a lot at the time. But years later, my attitude changed to “couldn’t he just use Sentry or something?”

My first experience with Sentry was by accident. We just integrated Jaeger SDK into a Python framework that was widely used at Uber. Next morning I am getting a UBN ticket (“UnBreakNow”, i.e., high urgency) that says they are getting errors in production and the stacktrace points to Jaeger SDK code. But instead of a stacktrace the ticket had a link to Sentry, which was the open source system Uber deployed to capture and aggregate exceptions. I was blown away by the amount of information captured by Raven (Sentry’s SDK) besides the stacktrace itself. The most useful was the ability to inspect values of all local variables at every frame of the stack. That immediately revealed the root cause, which had to do with the handling of utf-8 strings.

Exceptions are, strictly speaking, a specialized form of structured logs, although you may need much more structure than the typical structured logging API allows (like nested collections, etc.) The processing pipelines for exceptions are also pretty specialized: they often involve symbolication, fingerprinting, stacks pruning, and deduplication. Finally, the UI for viewing this data is also highly customized to this data source. All these factors lead me to conclude that exceptions should really be treated as an independent telemetry type.

Pillars are not Observability

Now that we covered the six pillars, it’s worth remembering that pillars do not guarantee observability, which is defined, perhaps counterintuitively, as the ability to understand the system (‘s internal state) from its outputs, not just to observe the outputs. These pillars are just different types of telemetry that can be produced, the raw data. To be effective in investigations, the observability platform needs to be able to combine these signals into solutions for specific workflows. Even with free-form investigations, where you are “on your own” because all the guided investigation workflows failed, the platform can provide many features to assist you, such as understanding the metadata of the telemetry and allowing cross-telemetry correlations, or automation of insights and pattern recognition. Pillars are what you build upon, not the end goal.

Takeaways

  1. Stop saying “three pillars”. There are more than three.
  2. Start saying TEMPLE, if you must name them.
  3. Don’t take it seriously. The boundaries are diffuse.
  4. Pillars ≠ observability, they are just data.

--

--

Yuri Shkuro

Software engineer. Angel investor & adviser. Creator of tracing platform Jaeger. Author of “Mastering Distributed Tracing”. https://shkuro.com/