AI & Observability

The Future of Distributed Tracing

Aston Whiteling
Another Integration Blog
11 min readJun 11, 2024

--

This article is a more detailed write-up of a talk I originally delivered at TrailblazerDX in San Francisco on the 7th of March 2024 — a version of which you can watch below.

MuleSoft’s Otel. Trace Exporter & Distributed Experience for both CloudHub 2.0 and RTF are GA today! Read the announcement here.

The Future of Distributed Tracing

Distributed Tracing has the potential to shift the paradigm within monitoring and observability.

So why does it continue to lack mainstream adoption?

Despite holding the #2 spot in the Cloud Native Computing Foundation’s portfolio — right behind Kubernetes, a technology that’s barreling towards omnipresence across the IT industry — businesses attempting to integrate Distributed Tracing into their core observability strategy are few and far between.

This wasn’t always the case; when the project gained traction circa 2017, there was an enormous amount of buzz — so what happened?

Distributed Tracing is part of the OpenTelmetry CNCF project | source

In short, Distributed Tracing is a technology with certain barriers to entry — and these barriers make adoption prohibitive for complex micro-architectures.

This reality somewhat derailed the project’s transition from hyped concept to mainstream adoption.

The tide is once again turning though, thanks in large part to a broader paradigm shift affecting the global technology industry — the advent of AI. Every aspect of IT is being revolutionized and monitoring is no exception.

Before I get in to exactly how AI is poised to impact the realm of observability, I’d first like to set the foundations for this topic by defining what Distributed Tracing actually is, why it’s important and what problem it intends to solve.

What is Distributed Tracing?

I find Distributed Tracing is best illustrated via a plumbing analogy.

Think of your technology stack as a pipe network, with the ingress of data representing water as it flows around the system.

Now imagine your eCommerce systems stops working; this is the equivalent of a pipe rupturing. The metrics swing red and errors pour out of the logs.

Distributed Tracing is like adding dye to the water; it allows you to understand how data flows across the system in a holistic sense. This gives you a better understanding of the network as a whole — where the bottlenecks are forming and how they can cascade into a system rupture.

In reality, Distributed Tracing is the practice of appending a tracker to the header of every request that moves across your application network. This header is proliferated from app to app; it’s maintained as metadata at every stop and for the request as a whole — this is what we refer to as a distributed trace.

Source

Why is Distributed Tracing important?

The paradigm shift distributed tracing represents begins to manifest when you accept this:

Using a request as the unit of measure when evaluating network health is inherently more valuable than relying on events.

Event evaluation is driven by metrics and logs, and these are subjective tools; get five executives in a room and ask them what the metrics say, and they’ll give you five different answers — all of which coincidentally shift the blame away from their department. Logs can be subjective too; they’re written by individual developers and their root meaning can be just as hotly debated among engineers.

More often than not, they’re also disjointedly tied to a subset of applications; this makes it hard to see the forest for the trees when you’re plotting strategic decisions to try and improve your application architecture.

By indexing around a request instead, you get much closer to the actual end user experience — with an interconnected view of how a request interacts and progresses through your entire network of applications.

Even so, event driven incident tools such as logs and metrics are still the dominant force in the observability industry.

Logs and metrics have been around a lot longer than tracing, which explains their dominance — to a degree.

But I think there’s been an industry shift that’s led to an over-indexing around metrics as a catchall tool for technical challenges.

Personally? I blame DORA.

No, not that DORA.

The DevOps Research & Assessment group.

Even though it’s not really their fault.

You see, a few members of the DORA research group authored a seminal work within the field of DevOps: Accelerate.

This text kick started the modern DevOps revolution. It outlined four core metrics to evaluate a successful DevOps process: Deployment frequency, lead time for change, time to restore and change failure rate — these metrics are now collectively known as the DORA metrics.

When something as complex as DevOps gets reduced to an easily digestible set of measurements, without fail the industry rallies behind the concept; everyone wants to buy into a secret formula that demystifies and distills complexity into a set of metrics.

And not to take anything away from DORA or Accelerate; it’s seminal research for a reason.

But implementing DevOps is not easy. Tracking four metrics isn’t going to change that fact.

It’s unfortunate, then, that because of frameworks like the DORA metrics, this attitude of indexing around metrics has become so pervasive.

The issue with this approach is that it only provides a broad-stroke snapshot of performance. A metrics view is relevant to the operational processes that feed into your strategic goals, but it lacks the technical depth required for true root problem analysis.

Logs are the inverse; they provide the necessary technical depth to trouble shoot, but the insight they provide can be rather easily siloed within a specific subset of applications, without the breadth of context that metrics provide that influences your overarching monitoring strategy.

The true power of tracing is that it provides an additional context layer on-top of these approaches, marrying the technical depth of logs with the strategic understanding of metrics to create a much richer view of your technology stack.

With this understanding in mind, we can now answer the question:

What problem does Distributed Tracing solve?

“Ask any developer what the most painful moments in their life are and they’ll talk about time spent debugging a Sev-1 error in production with what seemed like a few hundred people breathing down their neck.”

Samyukktha via Medium

I think this quote summarizes well the problem that Distributed Tracing tackles. When something goes really wrong, Sev-1 production error wrong, the only thing that matters is how quickly you can resolve the issue.

The traditional model of either “check the logs and then use the metrics to figure out how broad the problem is” or “the metrics are going red, go check the logs and find out which apps are breaking” leaves out some pretty important context; namely, where the issue is stemming from and who of the involved parties is best suited to provide a fix.

The context layer that Distributed Tracing adds is the ability to scan your entire application network and immediately understand the nuance of the issue; you can then follow standard procedure and use logs and metrics more pointedly to monitor and solve any issues. When the dust settles, you can also take a step back and examine your network holistically and decide how to implement operational or strategic process change in order to stop such issues from re-occurring.

As more and more businesses mature in their digital journey, a tool that offers this degree of observability becomes incredibly important.

Modern application networks are becoming exponentially complex due to the sheer scale and technical nuance involved in linking together so many applications. Take this example below — a visualization of the 1600 micro-services utilized by Monzo, a UK-based bank.

Source: Theregister — how does Monzo keep 1600 Microservices spinning?

When your application architecture looks like this, traditional monitoring & observability practices need to be re-evaluated — is it any wonder that 73% of companies are reporting that their mean-time-to-restore service is greater than an hour?

I’ve spent the past thousand words or so waxing lyrical about Distributed Tracing; so much so that you may be scratching your head wondering “well if Distributed Tracing is so great, why ISN’T this technology seeing widespread use?”

And that’s because it isn’t all sunshine and rainbows; tracing has it’s own challenges, and they create a barrier to entry that can often seem insurmountable for those that want to get started.

The barriers to adopting Distributed Tracing — and how AI is going to remove them.

The Configuration Issue

The biggest challenge facing IT teams who want to get started with tracing is setting up their application network to parse traces. Easier said than done when application architectures such as the Monzo example above are becoming the norm.

Distributed Tracing can’t be done in half-measures; the value is the holistic view, so setting up only a few apps to be handle the trace metadata defeats the point. The IT investment required to get tracing setup can be extreme when you’re running a complex architecture, which is a real deterrent for the technology.

Starting with a subset group of applications is a solution, but with distributed tracing the true value is derived from the economies of scale your whole application network represents.

Allow me a moment of shameless self-promotion when I say— this isn’t an issue if you deploy your applications on MuleSoft’s CloudHub 2.0 or RTF. We’ve already done the heavy lifting setup-wise, so the moment you deploy an app on either platform, it’ll be immediately configured to parse trace data. So if you want my advice? Best practice is to run everything via MuleSoft.

Having said that, I’m a realistic Product Marketer — I know IT teams have a diverse set of applications spread across a variety of different systems. There’s a solution on the horizon for you too: auto-instrumentation tools. These are light-weights agents that can be deployed into your app network and move from node to node performing the configuration work for you. OpenTelemetry already offers a rudimentary version of such a tool and it’s not unforeseeable that, with the augmentation of AI, these agents will almost entirely eliminate this large investment barrier for Distributed Tracing.

The Sampling Issue

Modern network applications are comprised of hundreds, if not thousands, of technologies which all communicate with each other in a more-or-less constant fashion. That adds up to a lot of requests; so many, in fact, that sampling is currently a necessary evil for Distributed Tracing to negate serious performance & storage requirements.

Industry standards dictate a range of 10–15% of all requests be traced. You then analyze that sample, determine your so called “traces of interest” and use these as the basis of your monitoring & observability practices.

To reiterate, the value of distributed tracing derives from economies of scale in having as complete a picture as possible; sampling reduces that picture and thus devalues the practice of tracing.

If you think about it, there’s almost no need to store 99% of traces; the only traces that are valuable are the ones attached to requests where issues are occurring. Unfortunately, such traces are the proverbial needle in a haystack of requests.

If only there was a cutting edge, widely popular technology purpose built for large data models…

Credit: BrianPenny via pixabay

I’m of course alluding to large language models, or LLMs.

LLMs are the perfect tool for combing large datasets. If one were to be configured to work with trace data, you could use natural language to examine specific queries of interest. In the example illustrated below, a developer troubleshooting an error could poll an LLM to understand which queries had above average latency, which threw errors and then narrow in on the intersection of that set to pinpoint the exact traces required for root-cause analysis.

Needless to say, the impact of such a tool for developers trying to fix any SEV-1 errors in a hurry would be immense. With the current rate of LLM advancement, this functionality should be available in the next decade or so.

The Trace Enrichment Issue

Storing the status parameters of every request as they move between each application in your network can create an inordinate amount of storage strain. As such, the current parameters stored by each trace tends to be limited, usually encompassing latency and exception data alongside details of the request payload.

I’m going to reiterate on economies of scale for the third and final time in order to really drill this point home: the power of Distributed Tracing is directly correlated to how vividly a trace can paint a picture of your application network. The more data available in the trace, the more accurate it can reflect reality.

What if traces could be composed of every data field required for root-problem analysis; logs, metrics, events, states, you name it — this would make traces are the ultimate debugging tool.

This isn’t possible right now, but it’s fast entering the realm of possibility.

AI & Observability

Much as a trace does, I’d like to take a moment to paint you a picture — one of a not too distant future.

Your an IT Ops. admin who wants to configure your company’s network of 2000+ applications to parse a trace in the header of each request. You spend maybe thirty minutes setting up your auto-instrumentation tooling, which then deploys an agent that completes the config in a matter of hours.

Your IT org. can now start pinning enriched trace headers to every single request being transmitted across your technology stack — capturing the logs, metrics, events, states, payload details, latency at every step.

The phone rings — it’s the VP of Sales. She says the current platform stability is jeopardizing millions in ARR. “Don’t sweat it” you say, as you begin to boot up your observability copilot and start plying it with questions about system performance, the apps involved and teams responsible; it sends you back a list of the affected services and who to involve — it also asks if you’d like to automatically create a tiger team on Slack to fix the issue.

“Yes. Yes I would.” you think to yourself as you click a button that spins up a slack channel that pulls in the involved parties and generates a text guide detailing the required fix.

You get a full eight hours of sleep.

Your company closes the deal.

Everybody is happy.

In this utopian future, the technology I outlined is essentially all a developer would ever need to debug any software issue.

Distributed Tracing is the axis around which this reality is being built.

So why not get started today?

Credit: Artapixel via pixabay

--

--