Observability

Published in

Loft

9 min readSep 8, 2020

Introduction

Before we define our tools and for what use which, it’s necessary to understand the difference between observability and monitoring. Observability consists in making information about the system available. For example, create logs in the application. And monitoring is the passive collection of the information, the observable information, about the system. For example, if the application has logs available, the monitoring will collect this information.

Once this difference is defined, it’s necessary understand what data we need to collect, why should we collect this data and where to collect this data.

The WHAT

When you start working with an application it is difficult to understand which data is best to collect. You can collect everything about this application, but once the error appears it will be difficult to know where to look to mitigate this error. So you can start with the USE and RED methods in order to understand better this application and then, understanding which data you should collect.

The USE method means Utilization, Saturation and Errors. This method applies to hardware, like: network, hard disk, RAM, CPU, etc. In this method you collect the average of the utilization of the resource, which means, the average time that the resource was busy providing work. The saturation of the resource, which means, the degree to which the resource has extra work which it can’t service, often queued. And the count of the errors. It’s important to know that a low average of the utilization can still lead to saturation due to peaks.

The RED method means Rate, Errors and Duration. While USE method applies to the hardware, this method applies to the service. In this method you collect the rate of the requests per second, which means, the number of requests per second. The number of requests with errors. And the duration of these requests, or, the amount of time those requests takes.

The WHERE

Some good examples of where observability could be applied:

Logging requests and responses specific webpage-hits or logging API endpoints (If you’re instrumenting an existing application then make a priority-driven list of specific pages or endpoints and instrument them in order of importance).
Measure and log all calls to external services and APIs. For e.g: Calls (or queries) to database, cache, search service etc.
Measure and log job-scheduling and the corresponding execution. E.g. cron jobs.
Measure significant business and functional events, such as users being created or transactions like payments and sales.
Measure methods and functions that read and write from databases and caches.

The WHY

A few reasons that make ‘observing the software system’ a good idea are:

It enables the engineering teams to identify and diagnose faults, failures, and crashes.
It enables the engineering teams to measure and analyze the operational performance of the system.
It enables the engineering teams to measure and analyze the business performance and success of the system or its component(s).

Thus, avoiding a serious business and operational risk.

Pillars of Observability

Once you understand the what, the where and the why make your service observable, it’s important understand the three pillars that make your service more observable.

Logging

Usually, the first step you implement in your service is the log. This is used to describe all events happening in your system over time. Independent of the format, the logs are a timestamp with a payload of some context. It’s good practice to use logs in levels. The levels are debug, info, warning, error and fatal.

Debug: It describe with details all information in your service. Used only when diagnosing problems, which means, when you are debugging a problem. Don’t use this level in production. First, you shouldn’t need it, since you shouldn’t debug there, perhaps in staging you might use, but not in production. Second, you can have space issues. This will make your logs grow very fast. Third, this will make the time to index very big. So, when you need to consult something will take long time to finish.
Info: Correspond to normal application behavior and milestones. This can be enabled in production, but it’s not a good idea, since, if enabled, you wouldn’t care much, once it’s to show the skeleton of what is happening. Usually, you enable this level in production when you start a new application. So, you can learn how your application works, but as soon as you learn disable this level.
Warning: It indicates that you might have a problem and that you have detected an unusual situation. It’s unexpected, but no real harm done, and it’s not known whether the issue will persist or recur. It’s good to investigate them eventually.
Error: It indicates that a serious issue is happening, but still allow the application to continue running. This requires someone’s attention as soon as possible.
Fatal: It indicates a serious issue, like error, but in this case the application will be aborted, probably to prevent some kind of corruption or other serious problem. This log will requires attention immediately.

These levels work in hierarchy, that means, if you define a log level as debug, all the other levels will appear too. The list is in order, so if you define as error, only fatal and error logs will be showed. Which means you should use the lowest level that you want to receive logs about as the Log Level for your application. Usually, for Production environment, it’s the warning level.

The log is immutable, detailed and extensive file. Because of that, it is very expensive to process and send in most cases. If you are running it with multiples servers, it’s necessary aggregate them carefully in a central location, otherwise it’s can be extremely hard to check each of them.

Metrics

Metrics are a numeric representation of data measured over intervals of time. They are optimized for storage and processing of the data, as they are just numbers aggregated over intervals of time. Metrics can harness the power of mathematical modeling and prediction to derive knowledge of the behavior of a system over intervals of time in the present and future.

Since numbers are optimized for storage, we can enable longer retention of data and easier querying. This makes metrics perfectly suited to building dashboards that reflect historical trends. Metrics also allow for gradual reduction of data resolution. After a certain period of time, data can be aggregated into daily or weekly frequency.

Metrics help define service-level indicators (SLI), service-level objectives (SLO), and service-level agreements (SLA). Those tools help you to know if your system is reliable, available and useful. They should reflect your business objectives and help you make choices to improve the service.

Service Level Agreement (SLA)

An SLA normally involves a promise to someone using your service that it should meet some goal over certain period, and if it fails to do so then some kind of penalty will be paid. For example, your SLA specifies that your system will be available 99.95% of the time, if not you need to pay a fine.

In Loft, we don’t use SLA, just OKRs to make agreements. “OKRs“ means Objectives and Key Results. They are a tool used by individuals, teams, and companies for setting goals to maximize alignment and transparency when pursuing ambitious goals. You can use SLA and OKRs at the same time. For the example above, your application is 50% available. You can create an OKR that says “Make application’s availability grow from 50% to 99.95%“.

Service Level Objective (SLO)

An SLO is the precise numerical target for system availability, which means, the precise number target in SLA. Any discussion you have in the future about whether the system is running sufficiently reliably and what design or architectural changes you should make to it must be framed in terms of your system continuing to meet this SLO. For SLA example, that specifies that your system will be available 99.95% of the time, your SLO is likely 99.95% uptime.

Service Level Indicator (SLI)

While SLA and SLO involves a goal and whether or not it’s being met, SLI indicates what is actually going on with your system. A SLI measures compliance with an SLO. In the example above, your SLI is the actual measurement of your uptime. To stay in compliance with your SLA, the SLI will need to meet or exceed the promises made in that document.

Tracing

Tracing is a span that represents an execution of code. It’s about understanding the path of data as it propagates through the components of our application. The trace is a representation of logs, the data structure of traces looks almost like that of an event log. When you combine different traces from a distributed system, you can see an end-to-end flow of an execution path. By using traces, you can know which part of the code in the system is taking more time to process inputs. Traces are more useful in a distributed systems where it’s hard to connect user calls.

The tracing works like this: your system A receives a request. Since you wanna get the trace of this request, your system will add a parameter in the header of this request, let’s call this parameter id. Now, you can identify your request by this id. Then, when you call another service, let’s call system B, you use the same id in the header. If you need to call more systems you just send with the same id. Finally, once this go back to the user, you will have a log with this id for each service that request went. Then, you just need to aggregate to know the tracing of your request.

Monitoring

When you make your system observable you want do something with its information. That’s when monitoring comes. Monitoring is here to detect errors or anomalies. If you have an error, it will help you find the root cause of it. It’s important understand that monitoring just aims to provide a broad view of anything that can be usually measured based on what you can specify in external configuration.

With this you can visualize, analyze and do some alerts for your application. The first step to monitoring your application is to visualize all available data. In Loft, we use three tools: New Relic, Grafana and Rollbar. Using these tools you can analyze the behavior of your application and help you debug eventual errors. After you visualize and analyze your service, you can add some alerts to it.

Alerts

Alerts are very important to prevent the application from crashing or detecting any normality. When you create an alert, it’s important have in mind that you need a plan for what should be done in case the alarm is triggered. It’s not a good practice to have an alert without an action plan, so you should avoid that. For example, if you have an alert that tells whether a specific page is slow, you need to know what to do if this happens. In this example, the action plan could be to check the tracing to see which service is slow, and then check the logs to know what happened

You can create an alert for some metric. For example, if you measure the percentage of the pages that returns 500. You can create an alert that tells you when this percentage is lower that the acceptable percentage, usually defined by SLO. If this happens, you should go find out which page is returning 500 and fix it.

The alerts are classified in two levels: Notification and On Call. The notification level means that an intervention is necessary, but not right away. Which means: when you see it, you will fix as soon as you finish what you are doing. But for On Call level, the intervention is necessary right way. The later is the most urgent type of alert and should receive special treatment, being escalated to a pager so it urgently requests human attention.