Observability

Shashi Bhushan
booleanbhushan
Published in
6 min readAug 28, 2021

This article sheds some light on the topic of observability and it’s constituent parts. It’s aimed towards app developers and dev ops with some experience around monitoring distributed systems and I hope the thoughts put forth here will put your thinking cap on regarding how you can leverage it for your applications.

Observability

Observability is the technical ability to be able to measure the current state of the system based on the system’s output. Observability is based on the events that the system outputs.

Observability uses three types of telemetry data namely metrics, logs and traces to provide visibility into the system.

Observability is achieved when data is made available from within the system that you want to monitor. Observability is the property of the system, the action of collecting this data is called Monitoring. You can increase observability of the system by tweaking the telemetric data that the system outputs. The concept is nothing new, in fact JMX has been available for quite a long time and supplies tools for managing and monitoring JVM applications.

This has become increasingly important in today’s tech world since application delivery has been shifting to model of CI/CD, containerization, micro-services and distributed systems. Issues in one system can cause failures into other systems and thus, monitoring demands greater visibility into these distributed system.

To add further on this, even if a system does not make itself observable, you can still monitor the system by calling an operation say every 10 second and monitoring response time and status of the response. This is called “Synthetic Monitoring”. Since this is external to the system, reliability of such metric depends on lot of factors. For eg. may be because of network latency on your end, the response time increases but you won’t know for certain how much time the request actually took to process on server side(and how much time spent in transit between your machine and server). All you see is the response time on your end. Hence, it’s always preferred that the system provides it’s own metrics for observability.

Let’s discuss about the three pillars of observability.

Logging

Focus of logging is on recording events. Each log is a textual record of an event and includes time and status with regards to each event along with the payload that provides more contextual information of the event. Logs are stateless with regards to each other and they are typically the first place you would look in case something goes wrong in the system.

Logs could be plain text (like a timestamp and a message) or structured (like json, which adds other metadata like current value of MDC, event status such as info or error etc. along with the log message). Structured logs are easier to query since we could do faceted search on it. A typical example of what we could log is the response time. A message event with response time would suffice as a log but note that, this does not tell us relation with other similar requests meaning it does not convey if it’s a typical response time or is slow. We would need metrics for this.

Metrics

Metrics are numerical values collected over an interval of time and helps us in measuring the performance characteristics of the system. Unlike logs, Metrics are always structured since we usually perform aggregations like getting percentile or average etc. on the metrics. Since metrics focus on aggregating the events, a single event does not have much significance in metrics. Note here that log was single event focused, where as metrics is not.

In Java, you could use micrometer for metrics instrumentation. It provides a simple facade over the instrumentation client of your choice (think of this like slf4j but for instrumentation). An example of a counter could be code like this

Creating a Counter in micrometer

You could also use micrometer to record the execution time of method from when it starts until it exits (normally or otherwise).

The result would look something like the screenshot below. I’m using influx DB to store the metrics. We could also add visualization on top of these metrics using Grafana, Chronograf or other similar tools. Code snippet show is from this file.

Metrics records in influx db

As seen in the screenshots, the metrics always have some context as well. In this example, the metrics is a histogram data instrumenting response time (mean, sum, upper bound etc) of a particular method for a particular time interval.

Metrics are used to identify “trends” in your system. Using these metrics, you could setup expectations for your system under normal circumstances and add alerting mechanisms on top of it. For eg. if typical response time of APIs is 90ms and the API is responding in say more than 500 ms for the past half n hour for 90 percentile of API requests, you could have an automated alert sent to your team based on this information.

Tracing

Tracing has request scoped focus, which means it’s focus is on tracing end-to-end journey of a typical request from when it enters your system to when the output leaves. Span is the primary building block of a distributed tracing system. A trace is made up of multiple spans, each span representing individual unit of work done by a micro-service and the ensemble of these spans forms the trace and it represents what operations are performed by micro-services (along with timestamp and other metadata) as the request moves through the distributed system.

Metrics can only tell you that there’s a problem and sifting through logs takes time and is akin to finding needle in haystack. Hence, tracing becomes that much important in distributed systems.

Tracing is implemented by generating a unique ID (usually called Trace ID) at the entry point of the request, which does not change throughout the lifecycle of the request. This unique ID is passed to each micro-service (along with other contextual information) that the request touches and each micro-service logs the unique ID along with timestamp and status (success/error etc.) of the operation performed by it. Then, you can aggregate this info using some log-aggregating tools.

Below shown image is an example of how distributed tracing system should ideally work. The request goes through different parts of the system and there’s a causal ordering in the request workflow. Tracing helps us visualize if a single request fans out into multiple requests and helps us understand when one failure results into another failure in upstream parts of the system.

For eg. if the Resource Service throws an exception when fetching the resource, does it cause exception in Billing service as well ? if not, what type of exception was it in Resource Service (client error like unauthorized access or server error), was the client billed for the request or not. Tracing helps us answer these questions.

Bottom Line

Even though new technologies like micro-services and containerization has made shipping new versions of our application easier, troubleshooting the application has become that much complex. in distributed systems, root cause of problems are harder to detect as compared to monolithic applications because of interactions between different parts of the system.

On a brighter side, these individual parts of system produces a lot of telemetry data and thus, provide an opportunity to for you to get more insights into the distributed system and it’s performance with monitoring and observability.

As a concrete example, I was recently creating an in-memory Caching mechanism with time eviction policy. Then, someone suggested me that as an enhancement, I could also add Metrics for cache hit, miss and expiry in my solution. This got me thinking that as a developer, what all I could contribute to the application to increase it’s observability. This blog is the result of that thought process. These small enhancements that we add will eventually form the backbone of observability stack in our applications.

I sincerely hope that this article give you a perspective around observability which is more in line with what we work on day-to-day, rather than presenting just a bird’s eye view of the whole picture. Let me know if you have any questions on the topic. Thanks for taking the time to read this.

Additional Reading

Adrian Cole: Observability 3 ways

Understanding Distributed Tracing

Observability vs Monitoring

--

--