Distributed Tracing Deep Dive for Eventhub Triggered Azure Function in App Insights

Shervyna Ruan
The Startup
Published in
5 min readNov 10, 2020
End-to-end Transaction Details

In this blog post, I will explain the correlation logic with a simple system that consists of 2 azure functions and 2 event hubs.

If you take a look at the end-to-end transaction detail page, you will notice that App Insights is capable of identifying the correlation of each event from our first function to the second. This blog post will explain how App Insights is able to correlate the events with azure function executions. Basically, App Insights is able to derive the correlation from the “Diagnostic-Id” property of each event that carries the correlation context from a function over to another function.

The two images below are the properties of each telemetry that were collected by default. This will help you better understand the second section. There are two different cases here, and I will explain this in detail in the second section. To help you understand the correlation more easily, I marked the same numbers with the same color.

1. When function processed only one event in that invocation:

2. When function processed a batch of events in that invocation:

Explaining how the distributed tracing works from source code

Before you read the rest of the article, you need to know what an “Activity” is. It is basically a class in C# that stores the contexts and helps you achieve distributed tracing. Activities are stored using AsyncLocal so that you can access this activity from anywhere in the call graph. App Insights relies heavily on Activities and it translates Activities into telemetries to work with.

https://docs.microsoft.com/en-us/azure/azure-monitor/app/custom-operations-tracking#applicationinsights-operations-vs-systemdiagnosticsactivity

An Eventhub Event:

{
"Body": //your event,
"Properties":{},
"SystemProperties":{
"x-opt-sequence-number":4,
"x-opt-offset":"568",
"x-opt-enqueued-time":"2020-10-29T09:42:25.55Z"
}
}

When DiagnosticListener is enabled, Eventhub client will inject the current Activity Id as “Diagnostic-Id” into the event property before it sends the event.

If “Diagnostic-Id” exists in the event properties, those “Diagnostic-Id” will be used to create a list of activities called “links”. In my first function, the events do not have “Diagnostic-Id” in the properties, so no links were created.

Screenshot 1 (in Microsoft.Azure.WebJobs.Extensions.Eventhubs)

If no links were created, the “StartOperation” method in the telemetry client will be called without passing in the activities.

Screenshot 2 (in Microsoft.Azure.WebJobs.Logging.ApplicationInsights)

When StartOperation is called without passing in an activity, it creates a new activity within the method:

new Activity("Microsoft.ApplicationInsights.OperationContext");

StartOperation will call activity.start(). Activity.start() then calls the “GenerateW3CId” method that will generate the traceId and spanId for that activity. If a parentId is set for an activity, it will extract the traceId from the parentId. This time, the parentId is null, so it will generate random traceId and spanId.

Screenshot 3 (in System.Diagnostics.DiagnosticSource)

The Id of activity is generated by traceId and spanId according to the W3C TraceContext.

Screenshot 4 (in System.Diagnostics.DiagnosticSource)

For request telemetry, the operation Id is set to the traceId of the current activity, and the request id is set to the spanId.

Screenshot 5 (in application insight dotnet sdk)

Similarly, traces and other dependency telemetries will get the information that it needs from the current activity’s contexts.

When the first function calls sendAsync to send the events to the second Eventhub, this time, the Eventhub client injects the current activity id to the event properties as “Diagnostic-Id”, in order to pass the distributed tracing contexts to the next function.

link to sendasync definition in EventHubClient.cs

(P.S. ActivityIdPropertyName = “Diagnostic-Id”)

Screenshot 6 (in Microsoft.Azure.EventHubs)
{
"Body": //your event,
"Properties":{
"Diagnostic-Id":"00-4d43caac0301954dbc76e5a786f6739a-6784fcb30fba654c-00",
},
"SystemProperties":{
"x-opt-sequence-number":72,
"x-opt-offset":"4294975936",
"x-opt-enqueued-time":"2020-11-04T16:46:46.955Z"}
}
}

This time we have Diagnostic-id in the event properties. Azure Function Eventhub Extension loops through all events and creates a list of new activities(links) where the parentIds are set to the value of each Diagnostic-Ids.

Same as screenshot 1 (in Microsoft.Azure.WebJobs.Extensions.Eventhubs)

There are 2 cases here when activities(links) are not null:

Same as Screenshot 2 (in Microsoft.Azure.WebJobs.Logging.ApplicationInsights)
  1. There is only one activity(link): This means that the second function only processed one message. It will call the StartOperation by passing in the activity. In App Insights, you will see that the first function and second function share the same operation id in this case.
  2. There are multiple activities(links): This means that the second function processed a batch of messages. In this case, it will create a new activity without using any of the activities, because each event could be processed by different functions invocations and can have different trace ids. So instead of using any of the links, it will only append the activities to the “_MS.link” property, which is the “Operation links” that you see in App Insights. The first and second function will then have different operation ids, but App Insights will be able to correlate them and show the end to end transaction results by using the operation links.
//_MS.Link[
{"operation_Id":"5eca8b153632494ba00f619d6877b134","id":"d4c1279b6e7b7c47"},{"operation_Id":"ff28988d0776b44f9ca93352da126047","id":"bf4fa4855d161141"}
]

In the first case where the activity was passed into StartOperation as a parameter, it will generate traceId based on parentId, which was previously assigned to the value of Diagnostic-Id. Since parentId is not null, TrySetTraceIdFromParent() will extract the traceId from parentId and return true. SpanId will still be generated randomly. This is why both function’s telemetry can share the same operation Id in the first case.

Screenshot 7 (in System.Diagnotics.DiagnosticSource)
Screenshot 8 (in System.Diagnotics.DiagnosticSource)

Related blog: Calculating End-to-end Latency for Eventhub-triggered Azure Functions with App Insights

--

--