Observability, Availability & DORA’s Research Program

Hamza Faouri

Published in

Alteos Tech Blog

5 min readFeb 17, 2021

— by Hamza Faouri

Observability and Availability

Observability is becoming one of the more popular terms that is essential to deploying and running distributed systems. There are many definitions for it:

For example, in control theory, observability is a measure of how well the internal states of a system can be inferred from knowledge of its external outputs.

Another definition is more specific to distributed systems, where observability is a tooling or a technical solution that allows teams to debug their system actively by exploring properties and patterns not defined in advance.

The term was mentioned on a blog post about Twitter in 2013. As a general concept observability is comprised of a few practices that will enable teams to improve visibility of distributed systems and help uncover issues when they are noticed by customers or, ideally, before then.

The idea of observability is also implemented in a framework as a CNCF project under the name of OpenTelemetry.

Observability and monitoring practices provide many benefits to improve and uncover problems with availability in distributed systems by providing indicators of service unavailability, degradations, detecting outages or bugs, identifying long term trends for scalability or capacity planning or simply exposing side effects of changes.

https://www.devops-research.com/research.html

The DevOps Research

The DevOps Research by Google has shown that high performance can be achieved in software development and delivery by improving technical and cultural capabilities.

This has benefits on multiple levels: it predicts or impacts several factors like the burn-out of engineers, as well as better availability of services and fewer deployment pains.

According to the study referenced above:

“Monitoring and observability are one of a set of capabilities that drive higher software delivery and organizational performance. The research states that installing a tool is not enough to achieve the objectives, but tools can help or hinder the effort. Monitoring systems should not be confined to a single individual or team within an organization.
Empowering all developers to be proficient with monitoring helps develop a culture of data-driven decision making and improves overall system debuggability, reducing outages.”

Observability Factors in High Performance

An important factor that impacts the performance of teams is the time to restore (TTR), which is the average time it takes to recover from a product or system failure.

A key contributor to TTR is the ability to rapidly understand what broke and identify the quickest path to restoring service (which may not involve immediately remediating the underlying problems).

Tools

At Alteos we embarked on a journey of defining how our monitoring tools and system will help us improve our TTR and provide a way to make data-driven decisions, when it comes to distributed systems design and rolling out changes.

Many important factors are relevant when it comes to choosing a monitoring tool, considering the age of cloud-native applications and containerized workloads.

For pull-based tools such as Prometheus, the way that service discovery works is important: Running services in containers with ever-changing addresses and DNS names presents a challenge when it comes to monitoring the atomic unit of applications (pods), similar logic applies to EC2 instances that are dynamically provisioned e.g spot instances.

Another important factor is that the monitoring system should be highly available and not regularly failing. There is no point in using a tool that is not reliable since it won’t provide any more insights about the reliability of the rest of your system and services. That being said, one must prepare and account for when the monitoring system fails and make sure that there is a proper response to that failure (monitoring the monitoring).

*Prometheus Brings Fire* by Heinrich Friedrich Füger.

Storing time-series data in a TSD is not an irrelevant task and is critical for historical analysis. As of today, Prometheus has an “opinionated way” of running a highly available setup to persist the data. That is, you run three instances of Prometheus with three isolated data stores – and then fully rely on the ability of other components to deduplicate the data later on.

There are a couple of issues with this approach: If it is not suitable for you, look for alternatives to store data from Prometheus with proper read/write capabilities using the remote_write_url feature from Prometheus.

Thanos, for example, is one option we considered. It seems to provide and complement all of the data availability features for Prometheus. However, the problem we faced with this solution is that the complexity skyrockets. Take a look at an architecture diagram here:

Another alternative solution for the highly available data store is Cortex. Cortex is another open-source approach to provide a highly available data store for Prometheus for reading and writing. The Cortex project decided to rely on other data stores that are already available and battle-tested. For example, you can choose to write time-series data to S3 or DynamoDB via Cortex software itself. But then again, we found out that this lands us into a complexity trap and, on top of that, there is a high cost for storing data on Dynamodb or S3 (many read/write OR high capacity settings for tables).

Here is a diagram representing cortex components:

Eventually, we found a simple solution that offers a simple and robust implementation for a TSDB, while still being cost-effective. The project is called VictoriaMetrics. It’s a simple TSDB written in Golang.

The solution is based on shared-nothing architecture, you can find more about this awesome project on their homepage. Here is a look at the architecture of the tool:

Conclusion

Observability and monitoring empower engineering teams with capabilities to improve the competitive edge and many performance metrics for the team and systems at the same time.

Many challenges and considerations need to be anticipated when choosing monitoring tools since availability and cost are not irrelevant factors for most companies.

About the Author:

Currently leading platform engineering team at Alteos, implementing mission critical cloud infrastructure and responsible for security best practices while taming chaos with distributed systems.