Observability: Tracing vs Metering
Why I moved from tracing to activity-based resource metering
In 2008 I released the Probes Open API as part of the JXInsight and subsequently OpenCore. In this article, I’ll explain three of the most important reasons for completely abandoning tracing after building and deploying the very first distributed tracing technology for the JVM in 2003.
The number one reason, as was typical at that time, was the performance unit cost of measurement. I can’t recall the actual numbers at that time, but we can compare the costs today in using OpenTelemetry with the Probes Open API. Most implementations of OpenTelemetry or OpenTracing have a unit cost between 2–8 microseconds for a start and stop of a trace (span) in the JVM. Here I’m excluding any parsing or transmission costs of a trace context. On egress (exit) traces the unit cost can be much higher when the underlying tracing implementation captures the thread call stack as well. Many vendors do so to fill in large gaps in their runtime observability.
The Satoris implementation of the Probes Open API has a unit cost of between 0.1–0.2 microseconds when the execution of an instrumented method is measured. The cost drops to zero when the instrumentation is not measured. This is achieved because the Probes Open API has been designed to defer, and in many cases completely eliminate, overhead costs.
The Probes Open API, designed around the concept of activity-based resource metering, targets the measurement of internal execution paths within a microservice or library. Tracing is really only practical, when performance is a concern, at the entry and exit points of microservices. The Probes Open API is comprehensive in what it can observe, whereas tracing is very conservative. The less overhead, the more of and variety in path execution that can be measured without degrading performance.
Nearly all local and distributed tracing instrumentation libraries focused exclusively on measuring the wall clock time via
stop calls in terms of milliseconds. Only recently was microsecond support added. It is not possible to add any additional first-class measures such as
gc-time, and so on. Other possible measures must be collected by the client and added as annotations or attributes.
The Probes Open API, on the other hand, does not stipulate that the wall clock time be even metered. It allows for the metering engine implementation to include one or more resources referred to in the API as a
Meter and referenced via a
Name. The inclusion of a resource is all done under the hood via configuration, the way it should be. Behind the
Meter interface, is a thread-specific counter. The Probes Open API offers support for custom resources to be added via the
The Probes Open API is locality agnostic in that it does not assume post-processing of measurement data is done on some central server following transmission over many stages in a distributed data pipeline. The measurement data, when metered by a
Probe, can be introspected via the
Reading interface from within the application runtime.
The metering engine can expose metering for one or more nested activities within a time window for a specific thread via the
ChangeSet interfaces. This allows for an agent, application, or engine to employ various strategies and adaptive mechanisms to manage the measurement overhead at runtime automatically. This offers self-reflection of execution behavior at runtime alongside the standard Java class structure capabilities.
The design of the Probes Open API also enables for activity metering to be mirrored to another process where it is replayed in a near identically simulated environment for interception by extensions and plugins.
None of the tracing specifications or libraries have given much, if any, thought to the possibility of having the measurement data utilized by the application or extensions within the local runtime environment itself. The distributed nature of what is being observed, applications and services, is contrasted with and contradicted by the forced need to push data to a single server. Clearly, OpenTelemetry and OpenTracing are about hops (or spans) across processes. They should never be considered for local instrumentation though that is not what has happened in the tracing community because of the three-pillared thinking, blindly promoted without much self-reflection.