Observability in Distributed Systems: Logs, Metrics, and Traces

Published in

Big Data Processing

3 min readSep 15, 2022

Observability is the ability to measure the internal states of a system by examining its outputs. Logs, Metrics, and Traces are considered the three pillars of observability in Distributed systems. Observability is important for troubleshooting production systems in scenarios where the system deviates from its intended state. Having monitoring/alerts based on observed data helps us to act quickly when the system deviates from its expected behavior. An observable system provides us with all the information that we need in real-time to address the questions about a system. It also enables us to navigate from effect to cause whenever the system develops a fault.

Let's take a look at the three pillars of observability:

1. Logs

Logs are the easiest ones to generate. The log is just a string or a blob of JSON or typed key-value pairs makes it easy to represent any data in the form of a log line. In the case of observability, logs can convey information about what the applications are doing at any given time, in other words, logs should tell us a story about the main system flows. This should define what you want to log, and more importantly what you don’t want to log. By analyzing the logs, we will be able to troubleshoot our code and identify where and why the error occurred.

2. Metrics

Metrics is a measured value derived from system performance. It is a numeric representation of data measured over intervals of time which can be used to determine the behavior of the system over time. Unlike log, which records specific events, metrics are measured values derived from system performance.

Metrics make troubleshooting easier as we can easily correlate them across components to get a holistic view of system health and performance. Since metrics are numbers, they are optimized for storage, processing, and retrieval. This enables longer retention of metrics data as well as easier querying. Also, after a longer period of time metric data can be aggregated into daily or weekly frequency.

Metrics data can also be used to build dashboards for monitoring or for alerting.

3. Traces

Tracing helps to understand the entire lifecycle of a request across multiple systems. A single trace can provide visibility into both the path traversed by a request as well as the structure of a request. Although logs and metrics might be enough for understanding individual system behavior and performance, traces are an essential pillar of observability because they provide context for the other components of observability.

The is always some overlap between logs, metrics, and traces. Hence, One might have the question that if we need all three. The use of the observability tool must be based on the complexity and tolerance of the system. If the system is simple and tolerant, one might need only basic metrics for observability. But if the system is complex with multiple interacting microservices, we might need all three. Together metrics, logs, and traces help the observation and troubleshooting easier.

References:

Oreilly: Distributed Systems Observability by Cindy Sridharan
https://www.baeldung.com/distributed-systems-observability
https://iamondemand.com/blog/the-3-pillars-of-system-observability-logs-metrics-and-tracing/

Observability in Distributed Systems: Logs, Metrics, and Traces

1. Logs

2. Metrics

3. Traces

Written by Sruthi Sree Kumar