Goodbye, Tracing. Hello, Metering!

A new foundation for application performance monitoring

Tracing Time

The first Open API I designed and developed for performance monitoring of Java applications was your basic run-of-the-mill distributed tracing version. To create a measurement trace both start and stop were called to demarcate a block of code to be timed when executed. It used strings to identify a trace, with a string encoding convention when a type and value need to be specified. It was stack based — one that bridged different Java processes. It always measured each pair of trace calls. It created an aggregated measured trace tree. It always incurred some heap allocation for a trace. And it initially only measured traces in terms of clock time (ms) duration. Tracing hasn’t progressed far, if at all, in the last 15 years. The same goes for monitoring.

Rethinking Tracing

In 2007 I’d reached the limit to what I could achieve with tracing. I then spent the next year researching and developing a new novel approach to application monitoring that would correct the mistake that’s tracing, when used as the measurement foundation for various monitoring requirements.

At the time the cloud was in its infancy but I could already see that service metering was going to be a hot topic. I’d had in the past pitched activity based costing (ABC) of Java runtimes to BEA when they were really pushing hard with making the JVM far more powerful and valuable. Why not activity based metering — a meter could be a cost, a clock, a resource, anything!

Activity Metering

The Probes Open API was first published in 2008. In 2010 it was made available under the OpenCore project. After reengineering the metering engine for improved performance and scalability it was released as Satoris in 2014. Later that year I pushed the source code for the interfaces to github.

Here are some of the more important reasons why the new metering interface proved to be one of the best technical design choices made.

  • Overhead Control: The metering interface contract stipulates that a firing activity, a probe, may not, in fact, be metered. A metering engine can decide not to read any meters when begin and end calls are made on a particular probe. The decision to measure, collect meter readings, and create other data structures rests with an engine and extensions enabled.
Instrumentation should only be concerned with the demarcation of an activity via the creation of a probe. It shouldn’t be concerned with whether an activity was measured or not. This allows a metering engine to adaptively control overhead based on the activity, flow of activities and the situation itself.
  • Multiple Measures: The metering interface can read from one or more measures for a metered activity — a measure being a meter reading. The enablement of a particular meter within an engine is a concern external to the interface. It can be specified via configuration of the metering engine and/or the dynamic inspection of the environment. A meter can be a measure of time, monetary cost, or resource usage, local to a thread.
Instrumentation should never be concerned with what is being measured. Nearly all other interfaces assume one measure — clock time. Again instrumentation should only be focused on the demarcation of an activity. The default configuration of Satoris enables only one meter, the clock time in microseconds, but it includes many additional built-in meters as well as supporting custom meters via an extension point in the Probes Open API.
  • Minimal Data Collection: The data that’s collected when a probe is fired and metered is based solely on those additional metering extensions enabled within the engine. The interface to the metering engine does not require that a relatively expensive data structure, such as a call tree or call stack, be maintained and accessible to the instrumentation layer.
If need be, instrumentation should maintain its own stack. This ensures that a metering engine can run with the smallest amount of runtime overhead in cases where neither application or engine require need of such structures. With the out-of-the-box configuration of Satoris only thread specific aggregated metering for probe names are created, updated and accessible via the metering interface.
  • No Garbage Creation: The metering interface doesn’t require for there to be any sort of heap allocation when a probe is fired, metered, and it’s readings collected, once an engine and runtime are fully warmed up.
There’s practically no extraneous heap allocation incurred by Satoris once all threads within the runtime have executed and metered an activity, at least once. The only time Satoris will make additional heap allocations is in snapshotting. Every other tracing solution investigated continues to allocate with each trace.
  • Naming is Everything: The metering interface identifies an activity using a namespace, a composition of name parts, that’s treated as if a constant. Each name part can have one or more meta-labels associated with it. With activities identified via namespaces, divorced from class bytecode or metadata, it’s possible to record and playback, mirror and simulate, an entire episodic memory of a software execution with practically no impact on the execution of an engine or an extension.
Both Simz and Stenos can playback a recording of a software execution using namespaces since the metering interface and its various defined extensions points only identify and represent activities, e.g. metered method, in this way.
  • Engines & Extensions: The metering interface is a collection of inner interfaces within just a single enclosing class used to bootstrap an implementation via a well-defined service provider interface (SPI). This makes it possible to switch in another engine designed specifically for a particular environment or to decorate and intercept calls to the interface.
Satoris offers more than 50 metering extensions that can be enabled via a simple configuration of the metering engine, without any change whatsoever to instrumentation. The design and development of the engine and the many extensions was greatly simplified having an “interface only” Open API and SPI.

Lost in a Forest

One of the biggest hurdles I found in moving customers away from tracing was developers believing that they couldn’t tune an application unless they understood all possible call paths within a codebase. Those new to software performance optimization see the call (trace) tree as synonymous with code profiling. Invariably, this is a waste of time and a good way of losing sight of the goal, by spending minutes, possibly hours, expanding and collapsing tree nodes, going pretty much nowhere. It’s an old fashioned top-down approach that doesn’t scale in terms of code coverage and post-execution analysis.

Satoris offers an optional tracking metering extension for the tree huggers amongst us
Even with adaptive metering, the default operating mode for Satoris, it’s hard not to be overwhelmed by the above call tree. Without this pruning of the probe population, a call tree for this Java application would simply be unmanageable.

The minimal data collection requirement of a metering engine is the counting of metered probes and the assignment of meter readings to both total and inherent (self) running totals — enough to identify the hotspots.

The Satoris hotspot metering extension adaptively disables & labels probes leaving just the performance hotspots