Implementing distributed tracing with Azure’s Application Insights

Benney Au
Technology @ Prospa
4 min readApr 2, 2020

As software engineers, understanding distributed systems is essential to enhance, maintain and deliver business value. By implementing tracing, distributed systems can become easier to debug and diagnose.

This article is a guide to implementing tracing using Microsoft Azure’s Application Insights. Tracing in this article refers to writing events from related system such that an end-to-end transaction log can be reconstructed. In contrast, my previous article describes technical details behind how Application Insights makes this possible.

Distributed systems presents challenges that makes diagnosing issues harder:

  • Several systems participate in serving a single user request.
  • A concurrent system interleaves trace events from different requests

Monitoring the system becomes more difficult. Finding all the logs associated with one request is a nightmare.

How to implement distributed tracing using Application Insights

Distributed tracing implemented by Application Insights addresses the above problems:

  • It enable constructing a hierarchy of trace events even across process boundaries
  • Log messages can be tagged with contextual information making them easier to query

To describe an event that occurs in the same process as a child, call StartOperation and give it a name.

public void RenderEmail(string emailType, EmailViewModel emailViewModel)
{
using var operation = _telemetryClient.StartOperation<DependencyTelemetry>($"Render email {emailType}");
operation.Telemetry.Type = "Email";
operation.Telemetry.Properties["EmailType"] = emailType;
body = _razorViewToStringRenderer.RenderViewToString(emailViewModel);
}

Note the above code example uses using declarations which is new to C# 8

When this method is called inside an ASP.NET Core API request, application insights constructs a hierarchy.

An end-to-end transaction showing two horizontal bars stacked on-top of each other. The top bar is labeled “POST /request-email” and its length represents the time taken. A shorter bar beneath it describes the time consumed by a child operation “Render email”.

Traces can also be correlated across process boundaries too. If two systems collaborate in a subscriber / publisher relationship, their relationship can be inferred. This is illustrated below.

  • System A receives a HTTP POST message request
  • System A then places a message on a queue.
  • System B is a worker that processes the message
An end-to-end transaction trace with three horizontal bars stacked on-top of each other. The first two bars are labelled as System A and the third bar is labelled as system B. This illustrates how traces from related systems can be collected.

Note: when using Azure Service Bus with Azure App Services, this hierarchy is created automatically if both apps have application insights configured.

To achieve this in code, the message producer (System A) needs to write System.Diagnostics.Activity.Id somewhere in the message. On the message consumer side (System B), the following code needs to be added.

public async Task Process(T message, CancellationToken cancellationToken)
{
using var operation = _telemetryClient.StartOperation<RequestTelemetry>(new Activity($"Process message").SetParentId(message.ParentId));
try
{
await _inner.Process(message, cancellationToken);
operation.Telemetry.Success = result.Success;
}
catch (Exception ex)
{
operation.Telemetry.Success = false;
throw;
}
}

If the background worker is triggered by a timer, then no logical parent exists. In this scenario, only an operation name is needed.

using var operation = _telemetryClient.StartOperation<RequestTelemetry>("Process Timer Job");

Note: When using Azure WebJobs / Azure Functions, this code is not necessary. The framework will automatically wrap each function execution with an operation.

The above describes how a hierarchy of trace events can be created using System.Diagnostics.Activity and the StartOperation method. This helps to find all trace events involved in the context of a single user request.

Next, tagging trace events is demonstrated. Tags helps engineers query related traces events from different user requests; relevant information from the operation context is added to every log message. Consider the above example, where a user calls POST /message. Tags can help engineers identify:

  • what other requests this user made
  • what was the performance of all request made by this user
  • How many different users are making requests to the POST /message endpoint.

The simplest way to tag logs messages with relevant contextual information is using the System.Diagnostics.Activity.AddTag() method. This provides a log agnostic mechanism in .net core to enrich logs.

First AddTag should be added to the code.

System.Diagnostics.Activity.Current?.AddTag("UserId", message.UserId);

// any log message generatged inside this method
// will be tagged with UserId
var result = await _myService.Save(message);
if (!result.Success)
{
// this warning will automatically be tagged with UserId
_logger.Warning("unable to save data");
return this.Succeed();
}

Finally a telemetry initializer needs to be registered to tell Application Insights to add activity tags to custom dimensions. This has been helpfully implemented in nuget package Prospa.Extensions.ApplicationInsights. Once setup, all traces performed in the context of a User’s request will be tagged with that User’s Id. This makes answer the above questions on user impact possible.

Real use cases

This section describes some real use cases that distributed tracing and properly tagged logs enables:

  • When there is an outage of a particular service such as a database, the system will log a number of errors. Since all error messages have user id tagged, its trivial to assess customer impact. An engineer needs only to perform a distinct query on the error logs.
  • A distributed system has intermittent performance issues. The distributed traces illustrated above show timing information that can be examined to pinpoint the issue. N+1 database requests or a particularly slow API will immediately become apparent.
  • A daily batch job is executed, how is it performing? With operations logged in the way described above, it is easy to understand
  • how long jobs take to run
  • How many messages are processed each day
  • Are there intermittent socket exceptions caused by inappropriate `HttpClient` usage

Conclusion

This article has described techniques to help make diagnosing distributed systems simple.

  • To create a hierarchy of logs wrap your code with a using declaration and call StartOperation the extension method from Application Insights; pass the parent operation id if one is available.
  • To tag log messages, instrument your code using the System.Diagnostics.Activity.AddTag() method. Then register a TelemetryInitializer to copy the tags into the custom dimensions.

--

--