We No Longer Monitor: We Observe

Published in

Flux IT Thoughts

7 min readAug 21, 2020

In the last 5 years, the way in which we build software has changed drastically. The evolution of certain practices and how they complement one another is clear. Development teams became multidisciplinary to provide comprehensive solutions. Systems began to be more distributed, breaking the classic monolithic architecture and shifting towards microservices or serverless architectures. The infrastructure stopped being on-premise to move to Cloud services, with on-demand automatic provisioning. The DevOps culture has emerged together with a different way to face the challenges of pushing an application to production.

This new complexity brought along changes regarding how we control our systems’ health. There’s a difference between understanding what happens in one big application and what happens in an app comprised of 30 microservices, where each one of them can scale independently. This requires working on different aspects to exert control from different perspectives. We no longer monitor: now we observe. The concept of “observability” has emerged.

So, What Is Observability?

We can define it as a measure of how well internal states of a system can be inferred by its external outputs. A system is observable if the current state can be determined in a finite time period only through its outputs.

The IT infrastructure consists of hardware and software components that automatically keep records of each activity in the system, and these include the applications, databases, and security control activities, among others. The idea is to use and process the records to get to know the state of the system.

The observability concept can be divided into three main topics: event logs, metrics y traces. Each one of them addresses a different need.

Observability Elements

An event log is a record that describes an event that occurred in a system. Event logs are usually timestamped and contain a message that gives more context on the event with a degree of severity. These are usually stored in files or arise directly through the processes’ default output. They provide a complete and precise record of discrete events, including additional metadata on the state of the system at the time the event occurred. They used to be free-form text, but the trend now is to provide structured logs, so they can later be easily processed.

On the other hand, a metric is a numeric value measured over a period of time. Unlike an event log, which records a specific event, a metric is a measured value that derives from the system’s performance. Metrics usually contain information about indicators such as request quantity, available memory, service availability, etc. Moreover, they allow us to define graphics, thresholds, and to trigger alerts when a situation becomes abnormal.

A trace is a representation of a series of causally related distributed events that take place in a network. Those events don’t have to take place in only one application, but they must be part of the same request flow. A trace can be reformatted or presented as a list of event records taken from different systems that took part in the request execution. In the microservices world, it is paramount to understand what the call flow is like among them in response to the request made by the client.

Why Should We Use Observability?

Because we need:

To detect and solve problems in an early and proactive way to avoid risks at the production level.
To implement changes securely while all the environment is monitored.
To provide information to make adjustments to the applications and offer enhanced performance and user experience.
To provide information to optimize resource allocation.

How Are Metrics, Logs, and Traceability Related?

Each pillar has a precise goal and serves as a complement to understand what is going on in our system.

Metrics show us the big picture of the system’s general condition. It is the first thing we have to see: they are usually organized in dashboards and allow us to quickly see the most important variables such as service availability, server load, the number of active users, among others.

But these metrics can present anomalous behaviors, and this is where the other two pillars come into play: traceability and logs. They are our “double click” to understand the underlying problem. Logs give us the fine details of a system and metrics allow us to detect problems in a specific request’s flow.

In this context, it is key to have attributes that allow us to link the three elements: to easily move from a metric to the corresponding logs and the traced request. These attributes must be designated in the same way to perform this task.

Where Do We Start?

Choosing a Stack

It’s key to understand what our infrastructure is like to choose the right tool. At the end of the day, most of the time it’s a matter of costs.

Prometheus, Grafana: It’s the stack chosen by the Open Source and Kubernetes community. It is extremely solid and agile for treating metrics and it’s the most used product for that purpose.

ELK or EFK: It’s the most versatile option when it comes to integrating and exploiting logs. We exploit its whole potential using the premium versions.

Datadog: It’s a SaaS that comprises the three observability pillars in only one tool. It’s easy to integrate and its functionality evolves quickly. At times, we don’t get the potential we have on Grafana or Kibana.

Jaeger and Zipkin: these are tools focused on traceability. Both are open source, free, and widely used in microservices contexts.

Homogenizing Logs

It’s pointless to make an effort to install a tool to manage logs, collect them, and group them if we later realize that the logs we are generating don’t add any value. Writing quality logs is fundamental to later understand what is going on with the application. Since those logs will be indexed, it’s key to send lots of context information within them, apart from the main information we are sending: date, time, information about the request, the user or the server that is executing the operation, etc.

Moreover, to simplify their processing, it’s advisable to generate logs in a structured way; for example, using the JSON format. That speeds up the parsing and indexation.

Defining Availability Metrics

If we want to quickly understand when our system gets into an undesired situation, I suggest defining metrics that describe our application’s availability. This is where the functional aspect comes into play: sometimes, metrics that count the number of incoming requests are enough, but the best thing would be to individualize the availability of each one of the services we offer and to compare them to historic values.

Defining Alerts

Needless to say, we can’t spend the day looking at a dashboard to figure out whether things are going well or not. That is why we define alerts on the anomalous values of a metric over a certain period. Nowadays, sending these notifications through a channel like Slack and having the whole team checking these things is quite common; of course, in pursuit of the DevOps culture.

I think it’s important to highlight the cultural change as regards how we develop software. Developers should be aware of the life cycle of the products they build from scratch; and that also includes understanding and knowing how their software is running.

Some people say that we’ve stopped talking about “monitoring” because it was a super boring Ops task and that they named it “Observability” instead because it sounds more “geek”.

Beyond the joke, I believe that observability or monitoring (however you may call it) and its relationship with developers is crucial. It creates a flow (which is constantly enriched) that shows which of the things they do work well and which don’t so that devs are always the first ones to know what is going on when faced with a problem. It empowers them and makes them part of what they produce at all times.

Tools have evolved in such a way that they allow each player to have a detailed breakdown of the piece of software that belongs to them, as well as the decentralization of observability to be able to analyze different aspects at the same time. At the end of the day, having a good observability scheme and reacting to undesired situations as soon as possible will have a direct impact on the service quality we provide.

Know more about Flux IT: Website · Instagram · LinkedIn · Twitter · Dribbble · Breezy