Logging, metrics & distributed tracing — These are problems, not solutions!
No one who has ever needed to monitor and manage the performance and availability of a software service has responded with:
“I know what, I just need a distributed tracing solution to see how each actual service request is processed across multiple microservices and nodes”
“If I only had thousands, maybe millions, of metrics then I’d be able to better understand what it is this service does and from that effectively manage it”
“When we’ve routed all logs to one single logging service, only then will I have mastered how to best diagnose problems including application performance”
Logging, metric monitoring and distributed tracing aren’t solutions they’re redefinitions, a substitution, of the original problem — that of observability.
It is all an act
For those responsible for monitoring, and to some degree managing, an application or service the problem is one of seeing and then understanding software execution behavior and the resulting consumption of resources in servicing client requests or in the performance of routine internal housekeeping chores such as data replication. Logs, metrics, and traces are merely poor proxy representations and observations of the actual action performed by software. They are not what the software is or does.
Not everything that can be counted counts; and not everything that counts can be counted — Albert Einstein
To achieve a real understanding of software for the purpose of managing and changing we need a deeper level of observation that is as close as possible to what it is software actually does. Software does not log. Developers write logs calls to create an echo chamber for their own predefined questions and notifications, buried within the source code. Software does not count what it does. It just does what it does. Sadly, there’s no self-awareness achieved in counting. Developers write metric calls to expose the act of doing but not the actual doing itself. Software does not trace. Software might one day look to move around a distributed environment via the transference of code and context, but we’re not there yet and might well never get there with the growing adoption of #serverless computing. Instead, developers write trace calls to delineate the entry and exit of execution flow moving through a system. A trace does a far better job than logging and metric monitoring but it is still not what software does. Much like logging and metrics, call tracing is a pre-defined inquiry and data collection mechanism baked into software but not what software does when it executes.
Distributed tracing might initially sound like a sophisticated form of measurement of behavior but in actual fact, it is just tracing of some flow at coarse grain entry and exit points plus some propagation of tags. Distributed tracing is focused solely on cross process boundary flow, data and to a lesser extent execution coordination, at the system level. We can observe message exchanges between agents (or actors) but we can never fully understand the execution context that underlines such productions because this resides inside each of the software execution units. Observing such interactions and content transfer will not reveal the intent and context behind an action, which is paramount to deep understanding and effective problem diagnosis.
All doing is knowing, all knowing is doing
Admittedly I seem to be knocking the efforts of all those developing and selling application performance monitoring solutions based on the collection of logs and metrics or those promoting open source distributed tracing projects such as Zipkin, which is based on Google’s Dapper. But I myself built the very first distributed tracing solution for the Java platform. I also developed the first and most professionally designed Metrics Open API for the Java platform well before we had a plethora of open source projects offering similar collection capabilities. I built distributed tracing and metric monitoring solutions because at the time I thought they had something to offer in addition to the challenging area of code profiling. But then one day after a long period of frustration in not being able to reduce the expensive human aspect from application performance analysis to a level that was truly scalable I took a step back and asked myself what exactly was it I was trying to see and understand that metrics and traces just could not capture, collect, converse or convey. Why did my mind need to reconstruct the situation out of various logs, traces and metrics in order to have that moment of brilliant insight that others seemed to not have? Why at the end of the day were those involved in troubleshooting exercises surprised at the resulting diagnosis? How could they had not seen this before and why did it take so much time to uncover, an effort that involved many different and disconnected tools, techniques and technicians? It was because we were never really seeing the application.
The only time I was able to see the application was inside my mind and that is where the eventual solution to my problem lay waiting to be discovered. The tools and techniques employed by practically everyone in the industry only offered a tantalizing glimpse into the true nature of what it was that software did. They never truly captured the essence of execution (and action). These collection techniques were designed more for the human, in particular, those instrumenting or build instrumentation tools, and less so for the machine in the capturing of software execution behavior.
After stepping away it dawned on me that to fully capture what software does I must be able to reproduce the same the behavior but in another time and space. Like I had done so within my mind much like a crime scene profiling. If I could simulate the machine behavior, in the approximately the same manner such that logging, counting, and tracing could be employed as post execution enquiry techniques and not as anemic data collectors then I would have finally managed to truly see and experience machine action at the level required for deep understanding and eventual control as well as self-adaptation. I needed to capture episodic memories and mirror (transmit) them to other machines where they would be simulated (replay) in near real-time alongside the memories of all other concurrent machines in execution. In doing so I could record once and repeatedly playback such collective memories in various different ways and manners until I fully understood what was the nature of action within such software.
We don’t see things as they are. We see things as we are. — Anais Nin
The good regulator knows best
The most effective solution to the problem of monitoring systems is one that does not reduce or redefine the problem in terms of inquiry techniques masquerading as collectors of action. The solution is the software itself. The solution must be able to recreate the nature of software execution such that there is very little distinction between what is real and what is simulated within some playback. Developing such a solution comes with its own sets of problems but these are problems for the solution creator alone, not the user. In designing and developing Simz and Stenos I’ve overcome such challenges which I intend to discuss in a series of follow-ups to this posting.