Challenges and Solutions of Observability in Cloud Native Applications

João Sousa
xgeeks
Published in
7 min readFeb 21, 2023

Co-author:

Cloud native is transforming the way applications are designed and developed, leveraging the benefits of cloud computing infrastructure and services. Cloud native applications are characterised by their micro-services architecture, which makes them flexible, scalable, and self-healing.

At xgeeks, we have a team of plus 30 engineers working on a multi-tenancy, multi-cloud Kubernetes solution for a leading automotive company. The project includes CI/CD, database as a service, and observability, utilising cloud-native approaches.

As we all know, observability is a critical component of modern systems management, enabling organisations to have a comprehensive view of system performance and behavior, and quickly resolve any issues that arise. However, centralized observability in cloud native environments has been a persistent challenge.

In this article, we’ll examine the challenges of observability in cloud native, introduce the solution offered by open telemetry, and show how it can be used to achieve centralised observability. Join us as we explore the innovative technologies and practices shaping the future of cloud native development!

Setting up observability is hard

Observability setup is hard, no questions about it. Our goal, as an additional feature to the product that we are developing, is to offer customers an easy access to telemetry data through a centralised portal. To achieve this, we use tools like Fluent Bit, Prometheus, and OpenSearch to collect, transform, and load data.

Teams typically build observability for their systems and applications by using a variety of proprietary tools and solutions. This approach is often challenging because it required teams to use different tools and methods to collect and report data from different services, which make it difficult to gain a comprehensive view of the overall performance and behaviour of a system. Additionally, it often required teams to use multiple tools and methods to collect and report data, which makes it difficult to correlate data from different sources and quickly identify and troubleshoot issues.

One of the most common ways to achieve observability of the system is by using different monitoring tools as mentioned already above. These tools are used to collect and report metrics, logs, and traces, but they have different data models and schema, different ways of instrumenting the code, and different ways of displaying the data, making it difficult for teams to correlate data from different sources and to quickly identify and troubleshoot issues.

However, our initial solution has two major issues:

1. Non-standardised data

To build a centralised observability platform, customers need to implement /metrics endpoints and output logs in a specific format. This is a challenging solution for customers and even for us as software developers. We are relying on the customer’s application language, frameworks, and libraries to extract data that we don’t control.

2. Hard to correlate data

The capability of doing correlation between the tree data observability streams is very important when a Cloud Native system is being developed. Logging, Monitoring, and Tracing are the 3 pillars of observability, however, if they are not correlated it’s very hard to obtain cross-information between them. The method that we were using had that problem, each stream of data is treated differently. Besides that, having non-standardised data does not help, making correlation even harder.

The problems mentioned above are keeping us from improving our product regarding observability features for customers. We need to find a solution that merges the three pillars (logging, monitoring, and tracing) into a single braid.

Introducing Open Telemetry

Open telemetry is an open standard for collecting and reporting telemetry data that aims to solve these challenges. It provides a standard way to collect data from various systems and services, regardless of the technology or vendor. It enables organisations to collect data from a wide range of applications, regardless of the programming language or codebase used. This is possible because open telemetry provides a consistent data model for collecting and reporting telemetry data, and it also provides libraries and SDKs for various programming languages that can be used to instrument the code and collect data.

With open telemetry, organisations can standardize their metrics, logs, and traces by using a common schema, format, and protocol. This makes it possible to analyze data from various sources using the same tools and processes and to correlate data from different sources to gain a more comprehensive understanding of the system’s overall performance.

Single braid and multiple integrations

When we started investigating Open Telemetry, we found that it would address the major obstacles hindering improvement of our product. Many well-known APM platforms, such as Datadog, adopt a similar approach as Open Telemetry, offering a language-specific proprietary SDK to automatically transmit telemetry data for their services.

Open Telemetry can be divided into two main parts:

1. Instrumentation

Open telemetry provides an official SDK for the most popular languages. This allows developers, by including this SDK on the code base, to start instrumenting their applications. For languages like Java and PHP, which rely on a runtime to run, is even possible to use the OTL SDK without updating the code using open telemetry add-on/extension. Disclaimer: Some SDKs are still in development missing the collection of some types of data.

2. Collector

The collector component is a central hub responsible for receiving and processing telemetry data from various sources, such as applications or libraries. It performs tasks such as data validation, aggregation, and routing before forwarding the data to a back-end for storage or analysis. The OTL collector is compound by the following components, receivers, processors and exporters.

The 5 Commandments of Open Telemetry

After gathering Open Telemetry best practices, we built a framework on how to apply Open telemetry to our applications. The 5 Commandments of Open Telemetry outlines a 5-step process for achieving centralized observability with Open Telemetry. The steps involve instrumenting applications, collecting data from these applications, storing it in a central location, analyzing the data, and correlating data from different sources.

1. Instrumentation

The first step is to instrument the applications with Open telemetry libraries and/or SDKs. These libraries and SDKs provide a consistent way to collect metrics, logs, and traces from the applications, regardless of the technology or vendor. We created custom documentation so that our clients can adapt their applications to use Open Telemetry SDKs.

2. Collection

Once the applications are instrumented, the next step is to collect the data using an open telemetry collector or an agent. The collector or agent receives the data from the applications and sends it to a central location for storage and analysis. Once telemetry data arrives to the collector, we are enriching it using data that comes from other sources.

3. Storage

The collected data is then stored in a central location, such as a time series database, log management system, or monitoring platform. This allows teams to access and analyse the data from a single location.

4. Analysis

Once the data is stored, users can use Open telemetry compatible tools to analyse the data. These tools provide a consistent way to visualize, search, and analyse the data, regardless of the technology or vendor. Tools like OpenSearch and Prometheus can be used to achieve this purpose.

5. Correlation

With the data from different services stored in a central location, we can now correlate data from different sources to gain a more comprehensive understanding of the applications’ past and current behaviour.

By following these steps, we can achieve centralised observability with Open Telemetry by having a consistent way to collect and report data from various systems and services, and by being able to analyze the data using a single set of tools, which allows us to gain a comprehensive understanding of the performance and behavior of a system, as well as to quickly identify and troubleshoot issues.

Final Thoughts

It’s worth noting that Open Telemetry is still a relatively new standard and it’s still evolving, but it has already been adopted by many companies and organisations, and it has the support of major industry players. The Open Telemetry community is constantly working on improving the standard and adding new features, such as adding support for new languages, protocols, and platforms, which will make it even more powerful and easy to use.

We still have a lot to learn in centralised observability, but an important aspect of Open Telemetry is that it’s designed to work with other observability tools and standards, such as OpenSearch and Prometheus, which makes it easy to integrate with existing observability solutions like ours. This allows us to leverage the strengths of different tools and to use the best tool for the job, while still being able to correlate data from different sources and quickly identify and troubleshoot issues.

Centralised observability is a critical aspect of modern systems management, and Open Telemetry provides a powerful solution for achieving it. With Open Telemetry, organizations have a way to achieve centralized observability and improve the reliability and performance of their applications.

At xgeeks, we are used to implementing observability features on complex projects and this project is another opportunity to growth and learn, and at the same time improve a system that is live and being used by real customers.

Don´t worry, second part is on the way! It will describe the struggles and the goals achieved during the process of implementing these concepts. As you know, sharing is caring 😎

If you enjoy working on large-scale projects with global impact and if you like a real challenge, feel free to reach out to us at xgeeks! We are growing our team and you might be the next one to join this group of talented people 😉

Check out our social media channels if you want to get a sneak peek of life at xgeeks! See you soon!

--

--