Health, Availability, Debuggability
As we collect various observability signals from systems, it fosters a new conversation around the classification of the signals.
There is a significant discussion on observability signals and even strong advocacy for one signal over the other. Metrics, events, logs, traces or others? In order to provide some structure to the conversation, it might be productive to provide a high-level breakdown of the signals based on how we utilize them.
There are three main high-level aspects observability enables:
Even though I introduce this classification, it is possible for a single signal to fit all three. It is not the ideal option but imagine how people rely on watching their logs to see a service is up. They are utilizing logs to measure latency for certain operations, and dig into logs to see what’s going on when there is an issue. Even though logs can fit into all three categories, it requires distinct design and planning to make them useful in these categories. And although it is possible to utilize the same signal for all three aspects, we may rely on different signals for each because each signal type has its pros/cons.
Health checking is fundamentally critical for reliability. We previously wrote on what it means for a system to be healthy. Health is to describe the existence of a system and can be extended to answer its availability to accept work. “Is my server running?”, “Is my worker up?”, “Is my server ready to accept requests?” are the questions you can ask to health signals.
Health signals are critical for orchestrators such as schedulers and load balancers. Health signals are queried pretty often. They often have to be cheap to produce and easy to query.
One example of health signals is a health check endpoint served from long-running processes (typically servers):
Health signals should be made available quickly at startup. Faster health checking means less overhead for rescheduling, autoscaling, and crash recovery.
Availability is the systematic approach to be able to tell if a system is running as users expect. Availability is often quantified and measured by indicators (see SLIs as an example formal definition). Then, a target value or range of values (SLOs) are determined for each indicator to represent the acceptable values.
A system is having an outage when the values are out of the target value/range for the target amount of user requests. And example SLO, “95% of the requests will complete in 100ms.”, will trigger an alert when more than 5% of the requests have a latency longer than 100ms.
Availability signals are crucial for reliability because they answer the fundamental question of whether a system is working as expected as user wants. This is why availability signals need to report all instances of tasks. The instrumentation should be cheap enough so it is not downsampled. This is why often aggregated metrics are the preferred availability signal. They are compact, and cheap to collect and transfer on wire.
Debuggability comes into the play mainly for troubleshooting scenarios. Debuggability is also a major aspect in having self-descriptive systems. The more debuggable a system is, the more you can learn about it without having to rely on documentation and reading the codebase.
Debuggability signals can be considered more optional compared to health and availability signals because they come into play once you figure out the health and availability. Think about signals like logs and distributed traces. They are often more verbose and more expensive to collect. Downsampling the collection is a common strategy to deal with the cost in production. Dynamically enabling more collection and vice versa is also common. You might want to temporarily turn on some collection, investigate an issue and turn it off again.
Regardless of how you utilize the existing signals available to you from your systems, it is a more healthy approach to think about their role in these high level aspects. Then, you can easily design towards achieving health checking, availability and debuggability based on what signals you can possibly collect. A signal signal might be useful in multiple categories as well as you may need to invest into multiple signals.