Tracing allocations with EventPipe part 3 — Tracing without dotnet-trace
In the first post, I explained how to use EventPipe to detect and locate heap allocations in a target application. In the second post, I described a way to parse the output nettrace file with EventPipeEventSource
without relying on TraceLog
. At this point, the solution is still based on two steps:
- Trace the target application with dotnet-trace. This step collects runtime and rundown provider events, along with custom trace events which add valuable business context to the allocations.
- Parse the output trace file with
EventPipeEventSource
in “push mode” to extract the trace events. Using the “push mode” opens the possibility of real time tracing because it is an API that can be used on running processes.
In this post, I will try to discover if I can finally do real time tracing by connecting the EventPipeEventSource
to a running EventPipe session instead of using a trace file.
The problem with DiagnosticsClient
DiagnosticsClient
is one of the core types of the Microsoft.Diagnostics.NETCore.Client package (the Diagnostics client library). DiagnosticsClient
is required to start an EventPipe session on a target process:
The problem here is that the only public constructor for DiagnosticsClient
takes a process ID as a required parameter, so the process must be started beforehand. Yet, the GCAllObjectAllocation
provider is only effective if it is enabled before the application starts. I could solve this apparent deadlock until now by using dotnet-trace, but dotnet-trace cannot be used for real time tracing.
Getting a process ID without starting a process
One of my colleagues, @Kuinox_, pointed me to the CreateProcessForLaunch debugging function. This function allows creating a suspended process, getting its process ID, and resuming it later. It can be easily invoked in .NET using the convenient ClrDebug package.
The idea seemed interesting, so I gave it a try:
However, it turned out to be another dead-end:
- If the process is resumed first (C before D), then the
GCAllObjectAllocation
provider is not effective because it is enabled after the .NET runtime starts. - If the EventPipe session is started first (D before C), then it hangs because the
DiagnosticsClient
is unable to communicate with the suspended process.
CreateProcessForLaunch
is not effective because the target process is suspended too early. For our objective, we need a cooperative suspended CLR.
So, I decided to read the dotnet-trace source code (in the .NET diagnostics repository) to understand the appropriate workflow.
How dotnet-trace solves the DiagnosticsClient problem
After the challenging review of TraceLog
in the previous post, the review of dotnet-trace was clearly more enjoyable. The most interesting lines are located in two files:
- CollectCommand.cs: It is the entry point of the dotnet-trace collect command.
- ReversedServerHelpers.cs: It contains a set of helpers responsible for starting the child process and creating the
DiagnosticsClient
.
The review additionally revealed that dotnet-trace uses internal members from the Microsoft.Diagnostics.NETCore.Client library, thanks to a convenient <InternalsVisibleTo Include=”dotnet-trace” />
in the library csproj. I would have called that cheating if I had not planned to apply similar tricks.
Here is how dotnet-trace is able to start tracing from the application startup:
- It starts a diagnostics server on a dedicated named IPC channel.
- It starts the target application with a specific environment variable,
DOTNET_DiagnosticPorts
, set with the IPC channel name.
This environment variable sets up a diagnostic port, which is an IPC channel that can be used to exchange diagnostic messages with the target application CLR. By default, the CLR will open a connection to the specified channel, which is why the diagnostics server was started first. The CLR will also suspend itself and wait for a resume IPC command.
- It waits until the target application CLR connects to the IPC channel.
- It creates a
DiagnosticsClient
using the internal, IPC-based constructor. - It starts the EventPipe session with the request providers.
- It resumes the target application CLR.
Under the hood, the last two steps are sending specific commands to the target application CLR using the IPC channel. Step 5 sends a CollectTracing2 command while step 6 sends a ResumeRuntime command.
Copying dotnet-trace workflow
The workflow might seem hard to reproduce outside dotnet-trace, but a large part of it is implemented in the Microsoft.Diagnostics.NETCore.Client library, although it often depends on internal types and internal methods. For example, the control of the diagnostics server is implemented in the ReversedDiagnosticsServer.
Like in the previous post, I decided to use the IgnoresAccessChecksToGenerator package to access the internal members of the Diagnostics client library. Here is the implementation of the workflow:
Connecting everything together
At this point I finally had a way to achieve real time tracing on a target application with the GCAllObjectAllocation
keyword enabled. Moreover, by using the code of the previous article I could parse the trace events in real time without TraceLog
.
As a reminder, here are the steps for the initial version (V1):
- Trace the target application with dotnet-trace.
- Convert the output nettrace file to the ETLX format.
- Parse the ETLX using
TraceLog
.
The updated version (V2) is made of a single step where tracing and parsing are performed in parallel without any trace file I/O.
You can find example programs for both versions in my EventPipePlayground repository:
Back to my replay-based application detection
Of course, my goal was not to create sample programs but to improve my real allocation detector. If you remember the first post of the series, I created an offline allocation detector which replays the event-log files of my event-sourced business components. I applied the new workflow to this program.
Here are the durations for the processing of a daily event-log file:
- V1: ~00:05:59
— Tracing phase duration: 00:04:56 (generates a 1.8GB nettrace file)
— Trace file conversion duration: 00:00:58 (generates a 4.9GB ETLX file)
— Parsing duration: 00:00:05 - V2: ~00:04:53 (-18%)
So, with the tracing and parsing being done in parallel, the V2 basically gained the trace file conversion time. Additionally, I suspect that the conversion phase of the V1 was probably I/O bound, so the program might have performed poorly in the target execution environments:
- On-prem, because we run the replays on a single multi-core machine where file I/O would have been a bottleneck.
- On AWS, because the I/O performance of our workers is not as good as our on-prem servers and workstations.
One last optimization
The event-store replay engine that is used for the allocation detection has two modes:
- Synchronous: The event-store events are read and deserialized synchronously during the replay. The same event-store event instance is reused. It is relatively slow, but it generates almost no allocation.
- Asynchronous: The event-store events are read and deserialized asynchronously while the replay is running. A fixed number of preallocated event-store events are used. It is faster but it generates a few allocations, mostly from the async-await machinery.
Because my goal was to identify heap allocations in the business components, I initially decided to use the synchronous mode to reduce the allocation noise from the replay engine itself.
However, I ended up adding a custom event source which identifies the start, the end, and the thread ID of the event-store event processing. So, it was straightforward to discard the replay engine allocations because they occurred outside the start / end boundaries, or because they occurred on a different thread ID.
So, I switched back to the asynchronous replay mode. It reduces the processing duration from 00:04:53 to 00:04:32 (-7%).
Final worlds
First, on the functional side, I am pleased that we finally have a solution to automatically detect and report heap allocations on our performance critical components. We already had low-level automated allocation detection tests as part of our continuous integration pipeline. But for application-level allocation detection, we mostly had used garbage collection monitoring and had to rely on manual profiling to find the source of the allocations, which could be problematic if the garbage collections only occurred on a specific production instance. Now we can run offline tracing using UAT or production event-store files. The result is not only very precise thanks to GCAllObjectAllocation
, but it also reports the event-log event ID, which can be used by the developer to run and debug the code locally if the stack trace is not sufficient to understand the problem.
Secondly, I think that spending time studying the .NET diagnostics components was clearly not worthless. Even though I probably went a bit too far into analyzing the low-level implementations and protocols of the API, I strongly believe that the .NET diagnostics components are a very valuable part of the .NET ecosystem, with a very large range of use-cases.
Thirdly, I discovered that there are a lot of articles and documentations on the subject. They are a bit scattered, so I would suggest starting by the TraceEvent Library Programmers Guide, then the Diagnostics Client Library Instructions, and then browse the others markdown documentation files from both repositories. Additionally, you could check out Christophe Nasarre’s articles on the subject. However, be careful because the documentation is often shared between ETW and EventPipe, and part of it might be slightly inaccurate for EventPipe.
Lastly, I clearly advise using small programs to experiment with provider keywords and parsing options. I initially wasted a lot of time with a slow feedback loop by making changes to my real program. Also, you should start with a minimal set of provider keywords and only add more of them when you understand their related effects, instead of cargo-cult copying a list of providers from another project or code sample like I did in the first place.
That is it for this series of posts, stay tuned for more performance related content!
Many thanks Christophe Nasarre and @romainverdier for the reviews.