Observability — Tooling Decision Guide

Some essential aspects to consider before choosing a tool.

Tiago Dias Generoso
Cloud Native Daily
Published in
9 min readMar 2

--

As I mentioned in my previous article, comparing Monitoring with Observability, it is challenging to transform the old monitoring environment into Observability (O11y) solutions because the differences are enormous. The only way is to introduce new solutions if the old ones cannot provide what we need.

In that time, you should decide which Observability solution will satisfy your needs; in this article, I will cover some essential aspects to make your life easier in this phase. Therefore, we can split this evaluation into two phases: Understanding the Scenario and Technical Assessment.

Observability is a “socio-technical” area; people should work closely with technical stuff, side by side; without this closeness will be impossible to understand holistically what is going on. Without a good culture where everybody is working for the same target without the need to blame people, you will have difficulties implementing Observability.
— Tiago Dias Generoso

Because of that, we should put high efforts into understanding the company, its objectives, the culture, and its infrastructure; this is the most important thing we should do to succeed in deciding which solution to use.

💡 For instance, traditional observability methods might fall short while dealing with microservices, this is where OTel distributed tracing comes into play.

Learn more:

Understanding the Scenario

At this point, I aim to show some important things you should extract from the company that will impact your analysis.

Why do they need Observability? What are the results expected?
We need to understand why the company wants an Observability solution to reduce outages and improve reliability or if they are willing to reduce operational costs.

It is crucial to understand if the tool should be integrated with another solution, such as a Domain Agnostic AIOps tool, Application Resource Management (ARM), FinOps tools to control costs, improve sustainability and etc.

Is the company under specific policies? It is another crucial evaluation you should do to decide the tool to use; you need to be sure the tool is compatible with the security policies such as International Traffic in Arms Regulations (ITAR), Health Insurance Portability and Accountability Act (HIPAA), General Data Protection Regulation (GDPR). In addition, some of those policies require an on-premise solution, so you should be unconsidered if the tool does not provide it.

Security Policies Compliant

How do they want to operate the tool? How is the company organized?

You should consider it carefully to avoid problems with the solution implementation; you should understand, if a single team will centralize the tool management, if you need to allow multiple different teams to manage monitoring by themselves and isolate just for their applications, if they are using methodologies such as SRE, DevOps, FinOps.

It will influence the tool to choose because, depending on the result of this assessment, the tool should support multi-tenant or fine-grained Role Base Access Control (RBAC), and believe me, the most top solutions don’t have a good RBAC capability, blocking you to adapt the tool operations on the way you need.

The image below shows two important things to have if you need to distribute the administration. The right side is a single tenant allowing you to access different teams with different permissions for a piece of infrastructure or a specific application. The left side is related to a multi-tenant environment where you can manage the solutions centrally, but creating isolated tenants gives more deployment flexibility.

RBAC and Multi-Tenancy

Technical Assessment

With an excellent understanding of company strategy, you can start evaluations on the technical level. I will select some important ones I used, but you can add more or remove some of the ones I picked; everything will depend on your scenario.

Note: I will not cover Observability Concepts; if you need to learn more to understand some evaluations, please look at my other articles about Observability Concepts you should know and Observability Concepts you should know Part 2.

Billing Model
Unless you decide to use an OpenSource backend solution, it will cost you, and generally, it is so expensive that it is vital to understand how the vendors will bill you. I have seen many Billing models, but I will explain the main four:

  • Charge per Event — It will charge you based on the number of events the tool generates.
  • Charge per active Service — It will charge you based on the number of services the tool monitors, independently if it uses replicas or clusters.
  • Charge per host — It will charge you based on the number of hosts being monitored; some vendors can specify ‘host units’, specifying what they consider a host, for example, based on RAM size.
  • Charge Per Data Injection — It will charge you based on the amount of data collected by the tool, generally using gigabytes (GB).

Keep your eyes open for extra hidden costs, such as data retention costs.

Technical Support
Depending on the size of the company you are implementing the tool and the type of project and targets you have, technical support can be decisive for the project’s success.

Work to understand how good they are at providing basic support, SLAs for the tickets, support channels they have, if they have a dedicated account team to support you on big problems, understand if they offer premium support too, everything will depend on your need.

Another critical evaluation is understanding if they offer expert services, such as a dedicated engineer for a specific time. It can be important when you have a big project and need more people to complete it on the target date.

Instrumentation
We can have manual or automatic instrumentation, and we can have multiple agents or One agent concept. You need to understand how the vendor is instrumenting the applications.

Automatic instrumentation can speed up your implementation when you have to instrument many applications and need a team with knowledge of how to instrument manually. But on the other hand, it can limit the technologies we can instrument based on their list, so it is crucial to see if the vendor supports OpenTelemetry and custom metrics when they do not support automatic instrumentation for a specific technology.

One agent can be helpful because the agent will detect what we have on the nodes and instrument everything, but if you need to control what you want to instrument, everything can cause extra workload. On the other hand, multiple agents will cause additional workload due to the need to install these agents.

OpenTelemetry Support
Otel is an essential item on all Observability projects; it is a way to standardize the data collection allowing you to choose just the backend, reducing the vendor lock-in and improving the quality of the data collected from applications.

I explain Otel better in this other article: Observability Concepts you should know Part 2.

The tools can support Otel in two ways: using Otel Format (OTLP) or their own Otel Exporter.

You should evaluate if the tool can manage the data collected by Otel the same way it did for other data because some solutions have limitations on managing Otel the data.

Learn more:

On-Premise Option
I am a big fan of using SaaS solutions as much as possible; I can’t see many advantages to having on-premise solutions, so I only consider on-premise solutions where we can’t use SaaS.

In situations where SaaS solutions are not allowed for many reasons, such as security policies, this evaluation is crucial. That is why it is always important to consult the security team if the company has any security policies that can deny you to use of SaaS solutions for monitoring.

And if yes, you should find a solution that can offer on-premise installation.

Integration
I am showing on the flow below some integrations you may need to deliver a complete solution, including integrations with IT operations management (ITOM), IT Service Management (ITSM), Application Resource Management (ARM), AI for Operations (AIOps), notification and automation tools.

If you need to integrate with existing solutions or plan to do so, you must evaluate how well the solution manages the integrations.

Monitoring Mechanisms
I like to split the mechanisms into Application Performance Monitoring (APM), Synthetic Monitoring, and Real User Monitoring; I explained the differences in this other article Observability Concepts you should know.

Understanding each of them, you can evaluate if you need all of them and compare each among vendors. If you need both three, be sure the vendor can offer them.

Infrastructure Monitoring
Some solutions can provide wonderful features for application monitoring, but if you have a traditional environment, they cannot provide what you need regarding infrastructure monitoring.

If you have just Cloud Native applications, I don’t think you need to worry about it. Still, if you are using Hybrid Scenario, you should look if the tool can support your old infrastructure, is compatible with all technologies you are using, and so on.

If they can support your old infrastructure, you can avoid having a big headache trying to customize, adapt, or use two different tools to monitor your environment.

Comparison

To help you to start the evaluation, I create the table below, where you can rank the solutions based on the criteria you consider as the most important ones.

As you can see on the table I split the importance into 4 — High, Medium, Low and Optional providing points for each, a quarter on High ones worth 4 points and on Optional worth 1 point only.

Example Table to compare the features

Conclusion

The decision about which tool to use is a challenging task but can be decisive for the success of your observability project; a wrong decision can guide you to put more effort than you planned, can cause frustrations, and cost you money.

On the other hand, there is no right and wrong answer to all your questions; the most critical thing I recommend is avoiding solutions that cannot satisfy some of your biggest priorities, for example, security.

Because the tooling comparison is a labor-intensive activity, try to use only a few comparisons eliminating some tools, for example, unconsidering the ones outside your budget at the beginning.

Most of the vendors can offer you a Proof of Concept (POC), which I highly recommend doing, but after the paper assessment, and just for the tools you are considering as your option, it will consume time.

I know there are many other things we can include in this guide, I included the ones I considered the most important ones and will keep them open, but I am confident this guide will support your decisions.

Tiago Dias Generoso is a Distinguished IT Architect | Senior SRE | Master Inventor based in Pocos de Caldas, Brazil. The above article is personal and does not necessarily represent the employer’s positions, strategies or opinions.

--

--

Tiago Dias Generoso
Cloud Native Daily

Distinguished IT Architect | Senior SRE specialized in Observability with 20+ years of experience helping organizations strategize complex IT solutions. Kyndryl