Thoughts on Observability
I haven’t thought deeply about observability as a concept until recently.
Before this, I had practiced making the systems I work on more observable by adding logging in application code and setting up some dashboards that monitor network traffic and memory usage. I just never really thought about it as part of “making it observable.”
What’s interesting to me as I scratch the surface of this topic is to realise how we are using the term incorrectly as software engineers.
Observability as term originated from control theory of engineered processes and machines, and is defined as:
“a measure of how well internal states of a system can be inferred from knowledge of its external outputs” (Wikipedia)
Let’s take the analogy of the black box server to see what observability would mean with this definition.
The black box
You can send HTTP requests to this black box and it will reply with a HTTP response. You can, on your script, setup a timing function to record the duration that it takes for this black box to reply.
It takes 10 seconds to reply. Whoa. You have now observed that this system is slow.
You could then tweak your script to send 10,000 requests in quick succession to see how much load this black box system can take, and record the responses you receive over time.
It starts to return responses much slower after 1,000 requests have been fired. You have now observed that this system either has rate limiting in place and is throttling responses to your requests or it has not been setup very well to take the load from traffic spikes.
Notice that these are all things that are inferred from external outputs.
As an outsider poking this black box, we cannot tell why the system behaves in a certain way. We can’t, for example, ascertain which code path the requests takes inside the system and pinpoint the exact part that is slowing down its time-to-response to incoming requests.
Remembering we’re insiders
So here’s the thing: if we took that definition of observability from control theory and apply it to distributed software systems, then we’d be shortchanging ourselves.
Why?
Because as the software engineers of (parts of) a distributed system, we’re not actually dealing with a black box. We have read and write access to the source code!
The analogy of the black box above and the two scenarios that I’ve described are called metrics.
In software engineering, metrics is only one of three parts of observability:
- metrics
- logs
- traces
Again, because we have access to the application source code, we can add sensible logs at the application level that get collected by a logs visualisation tool like Splunk. This is like putting up waypoints in the code path that give us the truth of what the internal state of the system is as a request reaches the endpoint and moves through the system.
So logs are mostly about state in specific points in time.
Traces, on the other hand, are for piecing together a timeline of what operations were executed concurrently and in which sequence is.
In short, in software engineering, we don’t observe the system from the outside and we shouldn’t. We have a hand in how the system behaves (at least at the application level; probably not the service-orchestration or operating system levels) and we can and should make the unobservable observable.