Approaching Observability from a Domain-Oriented Perspective

Mario Bittencourt
SSENSE-TECH
Published in
8 min readJan 21, 2022

Observability is an important but still often neglected area of application development. With the popularity of distributed systems, its importance grew even more, and the techniques we may have known had to evolve to follow suit.

In this article I will go over the observability’s importance, present a not so commonly used domain-oriented approach and talk about the various identifiers (IDs) associated with it, so you can connect the execution of your application even when it is distributed in various services.

Observability

In my experience, there are at least two situations you will encounter with any system you develop — or interact with:

  • It will not always provide you with the expected results
  • It will need to evolve over time to be better at delivering the current use cases

A key aspect in handling both cases is that you must be able to tell what is happening within the system, from both a technical and non-technical point of view.

A famous quote attributed to Peter Drucker says “What gets measured gets managed”, which reinforces the importance of making the systems we develop, independently of the architecture choice, observable.

Traditionally, observability is known as the capability of inferring the internal state of a system based on its external outputs. In the software context, observability commonly relies on 3 pieces of information to achieve that:

  1. Logs

Expose what has happened at a given point in time with additional information (context)

2. Metrics

A specific value measured during an interval of time that is associated with a Key Performance Indicator (KPI)

3. Traces

Captures the end-to-end execution path a request took in a (distributed) system

While the first two are likely not new to many of us, traces are a relatively new addition to our vocabulary due to the recent popularity of the microservice approach. So let’s discuss what makes traces a must-have.

Figure 1. Logs, Metrics and Traces.

Challenges of a Distributed System

In order to properly value observability in a distributed system, let’s first start by looking at the monolithic architecture and establish a baseline.

In such architecture, one use case execution is typically handled by a single process, and compute node. That means that the state of the system can be observed by collecting the logs and metrics from this single process and even by manipulating data from its persistence.

Figure 2. Single node emitting logs and metrics.

With a distributed system we had to give up on all the previous guarantees and embrace a situation where one use case execution can span multiple nodes/services and even use completely different transport mechanisms, from synchronous API calls, to asynchronous messaging and even electronic documents deposited in file servers.

Figure 3. Multiple nodes and different integration mechanisms are used.

In order to be able to understand the state of your system, especially when needing to investigate problems or potential points of improvement, you are faced with a disconnect as each execution hops from service to service.

When you have a handful of services involved this may be just an annoyance, but as you grow and end up with dozens or even hundreds of services, this becomes a major issue.

To address this disconnect we will have to leverage a way to link these seemingly unrelated executions, in the form of distributed tracing.

Introducing Distributed Tracing

Distributed tracing is the ability to follow the execution paths as a use case execution transitions between nodes and contexts.

Figure 4. Fictitious trace showing the execution path through the various services involved.

There are various proprietary solutions that offer instrumentation capabilities, such as Datadog and New Relic, and at least one open-source one called Open Telemetry. Staying true to this article’s purpose, instead of focusing on a specific implementation or vendor solution, I will present the general concept that you can follow, and later on decide which solution to use.

In its most simple form, your application needs to pass along information that will allow the service that collects the telemetry data to correlate the executions as part of a single logical view or trace. For all purposes, this will be some form of identification (ID) passed on.

Figure 5. Using a common identifier in logs, metrics.

It is important to be consistent and factor in, as much as possible, all the different ways your application will communicate, such as:

  • HTTP API calls

Pass this ID in the headers of the call you make to the dependent services.

CorrelationId: 377360bb-3ec6–4122–9021

When integrating with third-party services, verify if they support the equivalent of pass-through parameters that can be sent back in case those services perform some sort of webhook call to you.

  • Messaging

Pass this ID as part of the metadata of the message you want to send. Some technologies, such as SQS, allow you to define message attributes, otherwise you can include it as part of the payload.

  • Document-Based

Pass the ID with the file, in the filename or as part of the content itself, in a similar way to what was done for messaging.

order-123445–377360bb-3ec6–4122–9021.xml

If you establish this standard inside your ecosystem, you will be able to navigate from the request’s inception all the way until the end.

Many IDs

Now that we have established how you are going to correlate the services, we are still short of establishing two aspects: how am I going to call and generate it.

For the ID generation, I would use UUID or ULID, for their distributed nature and improbable collision properties. For a deeper understanding of both formats and their differences, I recommend reading this article.

“There are only two hard things in Computer Science: cache invalidation and naming things.” — Phil Karlton.

When it comes to addressing observability, I have been proposing the following convention:

  • CorrelationID

Generated at the starting point of a use case execution and passed along the chain. If not, provided at a given service to be generated.

If the client needs to repeat the call for the same execution, due to a transient error, the same correlation ID would be passed.

  • ExecutionID

Generated by each service as it gets invoked. Independent of the CorrelationID and multiple calls to the same service will generate different ExecutionID even for the same CorrelationID.

It is important to highlight that you should use neither of the aforementioned IDs for any sort of application logic, such as keys for retrieving previously saved states for the use case. This falls under the concept of a session and should have its dedicated identifier and secure handling.

Domain-Oriented Approach

Now that we’ve covered why observability is important and the foundations of doing so in a distributed environment, I would like to discuss the approach of doing so with business goals in mind.

Let’s take a simple example of an inventory service that provides a feature to allow you to reserve a given number of products. Because of its importance to the business, some KPIs have been created to track how successful this is.

A simple implementation could be seen here:

If we overlook the naive implementation of the actual stock reservation, this example has some drawbacks:

  1. The code to log/metric ratio is low

We have roughly as many instrumentation lines of code as we have actual reservation-related code.

2. We need to know more about instrumentation details

The code references things such as the metric name that should be used and the exact context contents we need to pass.

Even if you consider doing some traditional refactoring, like making the stock a higher-order concept instead of a primitive type, you would just push part of the problem forward.

An alternative approach would be to capture all those instrumentation calls in a way that is specific to what we are doing, such as performing Inventory operations, while reducing the cognitive load on the developer.

This is where the concept of a Domain Probe can be helpful. Let’s look at an updated implementation:

And the probe definition would be:

The new implementation has some benefits when compared with the original one:

  • Groups the instrumentation concerns into a specific place

The InventoryProbe contains all instrumentation-specific code required for our Inventory needs. So, if you need to know what is instrumented for it you can go to that point specifically.

  • Instrumentation changes can be more easily changed

If we need to change how to instrument a certain log or metric, we can do it in one place, potentially without changing the business code.

  • Explicit intent

In our Inventory::reserve example, all instrumentation calls are named after the business concepts making it easier to follow what aspect we care about.

  • Reduced cognitive load

As a developer, I no longer need to know about the logger and metric specifics, such as what the KPI name/key used is, what type of context it expects and even the associated message to be used.

  • Reduced non-business-code

Even in this contrived example, we were able to write less code in the Inventory class to achieve the expected observability goals.

A potentially controversial benefit of this approach, at least for me, is that it should make it more “difficult” to add logs to your application. By difficult I mean that you should dedicate more time to thinking if a given occurrence should be logged or not, instead of potentially adding too many log calls as a just-in-case practice.

Conclusion

Like many aspects, observability is a complex topic, even if you exclude discussions on which vendor or tool should be used to capture, manipulate and display the instrumentation data you decided to capture.

Over time the scope increased from only lower-level concerns such as CPU, Memory in a monolithic application towards a KPI-driven and distributed nature.

In order to handle all of that, besides the actual vendor selection challenges, you should establish a system-wide standard for the correlation ID generation and consider approaching this from the perspective of the domain.

Even if you don’t adopt Domain-Driven Design, framing those logs and metrics gathering calls under a domain probe and using a business language behind them will ultimately improve how you manage your services.

Editorial reviews by Deanna Chow & Pablo Martinez.

Want to work with us? Click here to see all open positions at SSENSE!

--

--