Observability patterns for enterprise software systems

Published in

Solution Architecture Patterns

5 min readJan 13, 2023

How to use open source technologies to implement observability for enterprises

Introduction

Enterprise software systems are complex in nature and failures in such systems are unavoidable. Instead of trying to make a system perfect without any failures, the pragmatic approach is to take necessary measures to respond and resolve the failures once they occur since the former option is extremely difficult and costly. The term “Observability” is not a new term that is invented by software developers.

Observability is a term that was used in “Control theory” to define a property of a system that infers the internal state of a system by examining the external outputs of the system.

We can reuse the same definition in enterprise software systems. In such a system, the external outputs that we use to derive the internal state of the system can be identified as

Logs
Metrics
Traces

Logs are the most common approach to recording outputs from enterprise applications. These are usually stored in files and persisted for a given time duration and rotated after that time duration is elapsed.

Metrics are the statistical information related to the application which is used to analyze the behavior of the applications.

Traces are useful to understand what is going on in the system at a fine-grained level. Typically these traces capture each and every message that is passing through the system.

Implementing Observability

Usually, developers do not pay enough attention to implementing observability due to the additional effort that needs to be put into doing that during the development time. Due to the time pressures and various other challenges, they tend to keep it as a post-release task or a to-do item that will never get completed until they find out a critical production issue and the support engineers struggle to find the root cause. In such a situation, it is too late and a considerable amount of time is wasted due to a lack of observability in the system.

The 4 main steps of implementing observability are

Instrumentation
Correlation
Automation
Insights and Predictions

Instrumentation is the first step in implementing observability where the applications need to generate the required telemetry data from the source code so that the data collectors can receive and aggregate that for further analysis.

Once the data is collected and aggregated, there should be a mechanism to correlate different entries of the telemetry data to troubleshoot issues and identify the root cause. It is important to use a common approach across applications to correlate events that are coming from different applications.

It is an impossible task to allocate resources to look at each and every telemetry event and take decisions based on them. Instead, we need to automate the process of analyzing these events as much as possible, and only the events that require attention need to trigger human interactions.

In addition to responding to critical events, we can also use observability data to analyze the behaviors of the users and generate insights and predictions to that the business decision can be made to improve the performance of the system as well as the business.

Observability with logs

Logs are the most popular and most used approach to troubleshooting issues in enterprise applications. The applications can emit different types of log entries such as errors, warnings, debug messages, and informational details into these log files using instrumentation. The figure below depicts a typical pattern to use logs for observability.

As depicted in the preceding figure, different types of applications expose the internal states of those applications as external outputs using log entries. These logs can be aggregated or read by an agent which is running alongside the application. The task of this agent is to read the log entries and publish them into the log collector which will collect these log entries and pre-process them for further analysis by the log analyzer. Through the log analyzer, the users can either directly analyze the logs with some query language or use an external dashboard component to analyze the logs. Some popular log analytics tools are

ELK (Elasticsearch, Logstash, Kibana)
Grafana Labs (Promtail, Loki and Grafana)
Splunk
New Relic
Sumo logic

We can use the ELK stack to implement the above-mentioned observability pattern as depicted in the figure below.

Figure: Log-based observability with ELK stack

As depicted in the preceding figure, the beats agents running alongside the application reads the log files and push those log entries toward the logstash which acts as the log aggregator. Then logstash stores these aggregated logs in its unique storage and does the indexing, search, and analysis of the data stored in the log files. Elasticsearch is a powerful tool that can analyze different types of data such as structured data, unstructured data, and numerical, textual, or geospatial data so that users can derive valuable insights and contextual information from the data. Kibana let users interactively explore, visualize and share insights and monitor the system in a visual dashboard.

Observability with Traces

Traces are another way of implementing observability for enterprise applications. Here we publish detailed information about the data that are transferred through the applications using a common standard such as OpenTelemetry or OpenTracing.

Jaeger is a distributed tracing platform that is open source and used to implement observability solutions for modern cloud-native applications. It helps operations teams to capture traces across multiple applications and use that information to troubleshoot issues and improve the performance of the system. The figure below depicts how to use Jaeger to implement observability with traces.

Figure: Observability with Tracing using Jaeger and OpenTelemetry

The preceding figure depicts a use case where an application is utilizing OpenTelemetry-based SDK to instrument the application and publish telemetry data as spans to the Jaeger collector which will validate and transform the data before storing that in a storage backend. The storage backend can be an in-memory store, Elasticsearch, Kafka, or a database. Once the data is stored in the storage backend, the Jaeger query component will execute search and query operations to retrieve the traces which will be visualized in the Jaeger UI.

Learn more

You can find more details on implementing observability for enterprise software systems by referring to the book “Solution Architecture Patterns for Enterprise”. If you are interested specifically in the observability chapter, you can refer to that from here.