Observability : Logs vs Traces vs Metrics!

umang goel
6 min readDec 9, 2021

--

Software systems have become the integral part of any organisation. As the organisation evolve the software systems also evolve and become more complex in nature. With multiple components coming into play in performing a task in distributed systems, it becomes more difficult to monitor the system. Monitoring the system involves looking into the health of the system, identify the application issues, track the complete end to end flow of a request etc. Different components may have different variety of monitoring tools and alerting mechanisms to monitor, discover, identify and debug any issue.

Observability refers to various mechanisms used to not only trace issues in the system but also monitor and track the overall correctness of any system whether it is a monolith or micro service based system.

Logging, metrics and traces are often used interchangeably when talking about observability but each of them work in a unique way and have different outcomes. Using these systems alone cannot guarantee an observable system but good understanding and use of these powerful tools will help building better systems.

Failures in distributed systems rarely happens because of one single event generated by any specific component. Any failure usually may involve multiple possible triggers from one or more components of highly disjoint components in the system. So just by looking at the point of failure might not give the right insight of the cause of failure. So in order to reach the root cause it would be needed to:

  1. Analyse the symptom at granular level.
  2. Track the request lifecycle across various components in a system
  3. Analyse the interactions between various components.

In this article we will discuss what are logs, metrics and traces and what role they play in monitoring the system.

LOGS

Log is an idempotent record of a discrete event that happened in a system at any point of time during the request life cycle. A log usually includes the timestamp and context payload for an event and can be emitted in various forms like plain text, structured format like json or binary logs.

Logs are simple to generate and most of the libraries provide the mechanisms to push the log events to some centralised logging system like ELK, Sumologic etc. Logs help in debugging at very granular level and comes in handy to get a detailed insight of an event which percentiles and averages will not be able to provide.

Although logs generation might be as easy as writing a print statement in a program but there are certain things that should be kept in mind to utilise the logs to fullest.

  1. Excessive logging might have the adverse effect as it might lead to increased costs for storage and reduce the search efficiency of the logs. So it becomes important to decide what data we need to log and what not.
  2. Logs should contain all the contextual information that can provide some meaningful insight about the event.
  3. Logging should always be in asynchronous manner and should not block the request processing flow.
  4. While pushing data to logs make sure that critical/sensitive data is not pushed or is properly masked before pushing.
  5. Most event log data is relevant for a shorter period of time to unlike the analytical data that is needed for a longer run, so logs can have an archiving policy in place to improve the efficiency of the logging systems.

METRICS

Logs are not usually used to get insight on the performance of the system , detecting anomalies in the system or for business analytics as the data is mostly short lived and is most valuable for short time after the event had occurred. Thus the need arises for another monitoring mechanism called metrics.

Metrics are the numeric representation like percentiles or the averages which helps in monitoring the fate of a system holistically and are measured over an interval of time.

Metrics can be used to detect anomalies in the system, analyse behaviour of the system at different intervals of time, finding historical trends etc. Since most of the data is stored in form of counts and numbers so data can be optimised for storage , processing, retrieval and querying. Since numeric values take much less storage space so data can be stored for a longer period of time and can be used to monitor the historical trends and build different sort of dashboards for detecting the system health as a whole.

Metric usually is composed of following fields:

  1. Metric name
  2. Timestamp
  3. Labels

Unlike logs, metrics can be used to generate alerts more efficiently as running a query in time series database is much faster as compared to running a query over logs stored in distributed systems like elastic search. While generating metrics its important to keep in mind that if a large variety of labels are used it might increase the storage and querying overheads.

TRACING

Although metrics and logs provide good insight about system but they are limited to a single system/service and it becomes hard to understand anything else other than what’s happening inside a particular system. Thus creating a use case of another powerful tool called tracing.

Tracing gives the capability to monitor the fate of a request during its lifecycle across various components in a system.

Distributed tracing is a technique that addresses the problem of bringing visibility into the lifetime of a request across several systems.

Note: There might be a view that the request lifecycle can also be traced by aggregating the logs and metrics using some unique identifier across the system boundaries. But this increases the complexity of the system and if used optimally can give complete insight of the system in silos and nothing more.

Tracing helps the teams to identify the path of the requests through various services and understand the behaviour of the request at various junctures in the flow. Thus motivation behind tracing it to identify specific points in application, proxy, middleware, library, database, cache etc to find out:

  1. Any hops that are there in the request paths across the system boundaries
  2. Forks in the execution flow

Trace is used to identify the amount of work that is done on each layer. A trace is Directed acyclic graph composed of spans which are small units to work and edges between the spans are called references.

In above diagram the request started at Src and ended at dest. It took an overall time of 500ms to reach from src to dest. If the flow is expanded it gives the idea that 500ms was spent at different spans eg. 50 ms to reach to B and 400ms to reach to C and then another 50 to reach dest. Thus in this way trace can be expanded to get the in depth detail of each span and analyse the complete request flow end to end in the system. Thus giving the opportunity to identify and pinpoint the source where the latency or the resource utilisation increased.

Zipkin and Jaeger are two of the most popular OpenTracing-compliant open source distributed tracing solutions. (OpenTracing is a vendor-neutral spec and instrumentation libraries for distributed tracing APIs.)

Introducing tracing in the existing system may be tedious task as each system involved in the system should participate in generating the tracing information, miss at even one system will break the whole request path. Also tracing own applications might not be sufficient always as there might be several third party components involved in the flow like databases, streams etc which might need additional instrumentation.

Unlike logs and metrics amount of data needed for tracing is small as compared to logs as traces are heavily sampled to reduce the runtime overheads and storage costs as well.

Conclusion

Although Logs, metrics and traces are unique in nature and have different purposes but still they compliment each other. When used in combination these can provide deep insights of the whole distributed system and can help the teams to build much better systems.

Provide any feedbacks or clarifications or improvements in comments section.If you like to discuss on some design topic please add in comments section.

Happy learning…

--

--