Defining Inventa's Observability Strategy

Published in

Building Inventa

6 min readMay 25, 2023

Photo by Frederick Marschall on Unsplash

Upon arriving at Inventa, one of my primary responsibilities was monitoring our applications to better understand their behavior and our infrastructure.

At this point, Inventa was about a year old and we were transitioning from a serverless Lambda architecture to a microservices architecture with Kubernetes (EKS), where we didn't have a strategy for observability, alarms, or logs. More here How we developed our deployment model.

So, we started investigating something that would allow us to scale, that was modern, and that the price wouldn’t be a pain for the business as we scaled.

Based on these requirements, an RFC was structured and the result came up with three possible options.

Zabbix: Simple and easy to implement, with an agent on the instances, but the tool was created to monitor networks and evolved to keep up with the evolution of highly dynamic systems in a structure of containers and microservices. Since we were starting from scratch, we understood that something more robust and modern could meet our needs in the short, medium, and long term. We immediately ruled it out.

And then we went into a more in-depth comparison between Prometheus and Elastic Search, where we analyzed the following aspects:

Purpose:

Prometheus: is a real-time monitoring tool for distributed systems, designed to collect, store, and query metrics in real-time.
Elasticsearch: is a search and data analysis engine that can be used to store, search, and analyze structured and unstructured data at scale.

Storage of Data:

Prometheus: stores data in a time series format.
Elasticsearch: uses an inverted index engine to store data.

This means that Prometheus is optimized for storing and querying time series data, while Elasticsearch is more suitable for search and unstructured data.

Scalability:

Prometheus: is highly scalable and designed to perform well in distributed environments.
Elasticsearch: is also scalable, but requires more configuration and tuning to perform well in distributed environments.

Queries and Analysis:

Prometheus: optimized for querying and analyzing real-time metrics.
Elasticsearch: more suitable for analyzing search data in unstructured data.

With this, we understood that the ideal strategy for our scenario, in the medium and long term, would be to use both, where application and infrastructure metrics would be collected and stored using Prometheus, and for querying event logs and application logs, we would use Elastic Search.

However, we faced a significant challenge: we only had one engineer available to do the entire implementation. It was a large project where we wanted to cover the basic pillars of observability, which are logs, metrics, and tracing.

In the presentation of the RFC, one of our lead engineers blurts out in the virtual room:

Egberto: Do you guys know about Grafana Cloud?

Although no immediate decisions were made, an action point was added to explore the potential benefits of this solution.

In the study, we identified:

Roughly speaking, when we create an account on Grafana Cloud, which is a SaaS observability tool, the structure of the following services is automatically provisioned:

Grafana: Customizable data visualization platform that allows users to create, explore, and share dashboards and panels in real time. It supports multiple data sources, including databases, cloud services, and APIs, and offers a range of features for monitoring and analyzing systems like Kubernetes, Instances, Synthetic monitoring, Usage insights, Network usage, and more….
Grafana Loki: A service used for log ingestion and querying. Configuration is done through an agent provided by Grafana Cloud upon account creation.
Grafana Prometheus: A powerful open-source time series database, that can be a monitoring and alerting toolkit that enables users to collect and analyze metrics from various systems in real time, like our apps in Kubernetes.
Grafana Tempo: On the server side, it is an easy-to-use, highly scalable open-source distributed tracing backend. It is deeply integrated with Grafana, Prometheus, and Loki. It can ingest common open-source tracing protocols, including Jaeger, Zipkin, and OpenTelemetry, which is our case. On the application side, an auto-instrumentation agent for Java based on OpenTelemetry was integrated to generate tracing spans.
Grafana Alerts: It is a customized version of AlertManager highly integrated with the above-mentioned services.
Grafana Incidents: also integrated with the services mentioned above, for incident management, in the same format as PagerDuty, but all these services are already available and configured when creating an account on Grafana Cloud.
Synthetic Monitoring: We monitor how a specific endpoint is performing, with the possibility of configuring logins, headers, tokens, and more. For example, we use it to monitor our blog hosted outside our engineering stack.
Grafana OnCall: Also integrated with the services above, when we have an alert with an incident, it is possible to escalate it to the responsible team to act on the incident, and at the same time we have a unified location where all engineers can track the status of the entire incident cycle.
In the free version we can register up to 5 incidents.
Grafana Data sources: Grafana has plugins to integrate with virtually any possible data source, from Cloudwatch, Elastic Search, Kubernetes, Databases, Sentry, Windows, GitHub, and more.

Amazing, isn’t it?

Based on this new information, we decided to adopt Grafana Cloud as our observability stack, and then we got down to work:

By leveraging Grafana Cloud’s pre-existing infrastructure as a SaaS model, we were able to streamline our configuration process and focus on creating processes and automation to effectively utilize their dashboard and visualization service, alerting and notifications service, and data source and integrations service.

But we know that not everything is perfect, right? Let’s not forget that only one engineer is working on this project. At the beginning of any new tool that we start using, we have a learning curve, which is normal and during this journey, we noticed an incredible team in the CS department, with amazing and delightful customer service always helping us!

If you’ve made it this far, you’re likely curious about the cost of using Grafana Cloud. The pricing for Grafana Cloud is determined by four different variants:

Active metrics: the number of metrics being generated by the clusters and applications and consumed by Grafana Prometheus.
Logs: the number of logs in GB consumed by Grafana Loki.
Tracing: the amount of tracing in GB consumed by Grafana Tempo.
As it is a SAAS with a customized UI and authentication, we also pay per active user per month.

After the applications were created and deployed in the new infrastructure, the costs grew as expected, but we identified an opportunity to improve our costs related to large-scale consumption of custom metrics, given each application sends thousands of metrics by default and this would significantly increase the costs of Grafana Cloud. Therefore, we adopted a strategy to consume these custom metrics from our Kotlin applications using a local Prometheus that we deploy and manage ourselves, and later, we extended this strategy to infrastructure metrics as well.

Today, an overview of our observability architecture would look like this:

As we were migrating from Lambda to Kubernetes, we still needed to see the logs from the Lambdas. In Grafana Cloud this is easy because we can integrate a bunch of Datasource providers as a plugin by Grafana.

With Grafana Cloud working in our favor to abstract all the infrastructure parts (creating, configuring, and maintaining), we only had to configure integrations and create processes and automation. It took us a little over 2 months to create scalable and applicable ingestion processes for logs, metrics, tracing, incidents, and on-call in our organization.

This is how our days are here at Inventa. How about joining us!? We are looking for Engineers who enjoy the state-of-the-art platform and application development, and want to help build a very reliable, observable, and scalable product in Inventa! Hit here to apply and, if you have any questions, do not hesitate to ping us on LinkedIn.

Defining Inventa's Observability Strategy

Purpose:

Storage of Data:

Scalability:

Queries and Analysis:

Written by Marcos Gomes