Introduction to Observability

Matei Tiloiu
Beekeeper Technology Blog
7 min readFeb 6, 2024

Writing software has always been a very abstract process since its inception. You have an idea, spawn some requirements, do some brainstorming, then use whatever language you think works best for you and the next thing you know, voila, your “Flappy Bird” replica which is “totally different” from the original takes life.
But during the software development process, you don’t get the same empirical feedback that your application is performing correctly, compared to other disciplines which are more tangible. To mitigate this, solutions have started bubbling up. Some of these solutions are structural, they are principles and design patterns meant to be applied while you code. But even assuming that these are applied properly, there is still a lack of transparency over how your service is behaving while it is running.
In this article, you will be given a high-level overview of logging, metrics and tracing, which are some of the tools software engineers use to have more transparency into the state of their service at runtime.

Logging
Everyone has done this at the beginning of their programming life, it’s an instinctive approach, with high reliability against the cost of implementation. The advantage is immediate, you have visibility into parts of your application at runtime. Without logging, the only moment when you know that something is wrong in your application is when an exception is thrown and the application fails. The only feedback that you get, is by inspecting the code over and over again.

Why do we log?
The developer can easily use logs to debug a service by producing a statement about anything they want.
Does that mean that you can go ahead and start logging everything, like a reckless child? No.
Logging has a performance cost. Most logging frameworks are asynchronous to separate the writing from the main service thread, but if we stress this with countless logs, it’s going to negatively impact the performance of your application.
This is a slippery slope, we need to be more focused when using logs. As we saw above, logs can be used when debugging our service, but debugging is a verbose process and will print out all the nitty-gritty details that would just add a lot of noise in the production environment. Is it necessary that my production environment contain logs that I would write there for debugging purposes? Probably not. In production, we would be interested in identifying behaviour that should not happen, create context to user behaviour, or document the reason why the service has died.
One practice that is used to contextualise your logs and mitigate performance issues is to declare the level they belong to. There are various logging levels used for different purposes:

  • DEBUG: Very detailed information only used when debugging.
  • INFO: General information about anything you would want to record. Request information, database calls, audit, access management and performance.
  • WARN: Potentially harmful scenarios that would not stop the execution of the service, like the usage of a function that will be deprecated in the future.
  • ERROR: Self-explanatory, something happened and it should not have. Events that trigger this type of log should be linked with alerting so that users can be notified accordingly.
  • FATAL: Usually an error that leads to the termination of the application.

Once you arm your service with properly structured logs you will have the answer to the questions of “what is happening” and “why is it happening”.
But that question is not sufficient.
Nowadays, the traffic on the internet is huge, your service might be reached by millions of users. This means that you have to optimise your services to accommodate such traffic.
How do you measure this traffic and identify bottlenecks in your system? You might be interested in knowing the execution time of one of your backend methods, how would you do that with logs? It would look like this:

void methodIWantToMeasure() {
var startTime = ...;
var endTime = ...;
log.info("Method executed in : " + (endTime - startTime));
}

This looks horrible. The problem with this approach is that the same structure which gives logs their strengths is also their limitation. If you are measuring the execution time of the method, you expect to have as a unit of measurement, something like seconds, or milliseconds. Even if we could structure our logs to reflect that information (at the expense of clarity) we still have to do some extra logic to parse it. On top of that, using logging for benchmarking can quickly deteriorate the performance of your service.

We need to explore another solution for this use case.

Metrics
The way to move forward is simple, we need to have data that is measurable.
In our previous example, for our measurement, we are interested in the name of the method and the time it takes for the method to complete in milliseconds. The measurement would look something like:

{methodIWantToMeasure, 100, ms}`

If you keep measuring the execution of this method, you will have a more relevant dataset that previously was very inaccessible to you, for example, if we take this measurement over the last 24 hours, we can calculate the average method execution time within that time frame, or the time it takes for 99% of the executions of that method to complete (a.k.a. the P99 value).
There is a variety of measurements you can do in your service, giving you access to more nuanced analytics:

  • Counter that is incremented every time an action is done, for example, an email being sent, useful if your service is expected to send a certain number of emails per day, and it drops to 0, or if something happens and your service starts sending too many emails.
  • Latency of HTTP, Service, or Database calls.
  • The error rate of your application.
  • Memory usage of your application.
  • A percentile, represented by a percentage value between 0 and 100, represents the measurement of a distribution of observations. As mentioned above, the P99 value is a measurement of the distribution of 99% of observations.

You can take these metrics and create:

  • Visualisation dashboards: Easily accessible and give you insights into the availability, reliability, and performance of your service.
  • Alerting: Imagine having the P99 value of latency increase over time, we would want to be notified of that increase, and respond accordingly in case there is something wrong with our services.
Look at all the pretty colours.

Instrumentalising your service with metrics is a relatively lightweight exercise. If you use Quarkus for example, it’s just a matter of appropriately annotating the methods you want to measure.

@Timed(
name = “myTimer”,
description = “A measure of how long it takes to foo.”,
unit = MetricUnits.MILLISECONDS)
void foo() {
//hacking Nasa
}

So you can see, it’s a significantly less invasive addition compared to having to write the logic for yourself.
If metrics are used properly, they give you an insight into “how is my application performing”. Armed with logs and metrics, you already have a pretty good toolkit in your arsenal to ensure that your system is operating well.
However…
Modern applications operate at a vast scale and high traffic. In response to this, a new mentality/architecture has evolved that (if applied properly) is meant to scale better, isolate faults, deliver faster results and much more.
This architecture is called microservices architecture.
In the microservices ecosystem, your application is divided into smaller services that operate independently but are orchestrated to work together. Because of this, it becomes increasingly more complicated to trace a fault and have visibility over a system.
So we need another solution that gives us visibility over where is the action happening.

Traces
A trace is the journey a request has across a service, and it can be broken down into multiple spans which can represent a method, API, or database call. It is a visualisation tool that you can use to see if, for example, a specific method is taking up too long or if you have too many database/method calls. The advantage of having traces as a visualisation tool is that once your request needs to traverse the context of your service to a different one, you can propagate the context with it.
This way, when your call spans across different microservices, you will see the journey of your request across your distributed system and highlight potential microservices which might be a bottleneck.

As you can see, the capillarity of your request will be visible to you.

The drawback of using traces is that they are very computationally heavy. Luckily most of the tools that we use to instrumentalise our services with traces give you the option to control the rate at which traces are propagated. For example, you can choose to only trace 1 in 10, 50, or 100 requests to a method to ensure the system’s performance doesn’t go down.

Observability
We’ve seen how logs are showing you “What is happening”, metrics are showing you “How is it performing” and tracing is showing you “Where is it happening”, when you put together all of these tools, that’s when you get observability.
I hope that it’s quite clear the gain that you can have by implementing observability in your project, however there is still a small observation to be made. It might appear to you that observability is a monitoring solution, but that is an incorrect interpretation, and unfortunately a very widespread one. While all the tools that make it up can be used as monitoring solutions, observability is a mentality and a technical solution that shines best when trying to prevent and debug your application.
Try using observability before pushing your changes for review, as annoying as it can be to delay your delivery, I can guarantee it will be even more annoying once your new feature creates a memory leak and wakes up someone in the middle of the night. And if you are pressured into delivering something which is broken, keep in mind that observability can also be used to surface accountability.

Some concrete tools that can help you orient around what we discussed in the article:

Logs: https://logz.io/

Metrics: https://micrometer.io/

Traces: https://opentelemetry.io/docs/instrumentation/java/automatic/

References:

https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/

https://uptrace.dev/opentelemetry/metrics.html

https://medium.com/ula-engineering/application-logging-and-its-importance-c9e788f898c0#:~:text=Apart%20from%20debugging%2C%20logging%20is,by%20data%20produced%20by%20logs.

https://www.aspecto.io/blog/jaeger-tracing-the-ultimate-guide/

https://iamondemand.com/blog/the-3-pillars-of-system-observability-logs-metrics-and-tracing/

--

--