Tracing allocations with EventPipe part 1 — dotnet-trace and TraceLog

6 min readJul 24, 2024

Context

Part of my work at ABC Arbitrage consists in developing our trading platform, which contains multiple high performance applications running on .NET. These applications are not allowed to make any heap allocations after startup to prevent latency spikes from GC pauses. Even though we have low level automated tests to detect allocations at the class level, we also need to detect allocations at the application level.

Automated low level allocation tests have the same weaknesses as unit tests: they make assumptions about how the types are used and how they interact with each other. If the assumptions are wrong, the tests may pass, but the code can be incorrect (and in our case, generate undetected heap allocations).

Profiling the production applications to detect allocations is out of the question due to the performance impacts. Profiling the applications in the UAT environment is a good way to detect allocations, but it suffers from two issues:

Trading signals are very complex to emulate, and although our UAT environment is very similar to the production environment, there are slight differences, and some code paths might not be executed in the UAT environment (but every code path is unit tested of course).
Our UAT environment uses a legacy, in-process virtual market component that generates heap allocations, which causes noise in our profiling. We also have an out-process virtual market mode that could be helpful, but right now it is harder to configure, and we use the in-process mode most of the time. To be honest we should probably fix these issues during our team hack days.

I decided to explore another path. Our trading components are event-sourced, or more precisely command-sourced if the distinction is important to you. So, it is possible to take the event-store file and reprocess it offline, which effectively runs the exact production code paths of the business components. We internally call this reprocessing a replay. We simply need to run the replays with a profiler to detect allocations!

Note that this offline profiling will only detect allocations in our event-sourced business components, not in the infrastructure technical components. However, our technical components are quite stable, while the business components are under active development by multiple teams, so focusing on the business components has a lot of value. Also, it is much easier to write automated allocation tests on the technical components, and a better time investment because these components are more stable.

Furthermore, our event-store files are uploaded on Amazon S3, and we already have an infrastructure to automatically run replays on AWS Batch. It means we can run replays of the previous day event-store files for all our trading applications in a few minutes. We just need to implement a profiling mode!

Collecting traces

We use a mix of Windows and Linux in our trading platform. Our trading components run on both operating systems. Our on-prem replay executions run on Windows, while our cloud-based replay executions run on Linux. So, we needed a cross-platform profiling API. It turns out EventPipe is the perfect tool for the job.

EventPipe allows to capture trace events which are generated by event providers. The .NET runtime already includes multiple event providers which can help you analyze garbage collections, allocations, lock-contentions, exceptions, and many other technical aspects of your programs. It is possible to choose the type of the events you want to consume by enabling keywords on a provider: each keyword activates a specific set of events.

If you already know the trace events you are interested in, it can be challenging to find the related provider keywords. Christophe Nasarre pointed me to this file which can be used as an index from keywords to event types. For our use case, we simply need to enable the following keywords of the runtime provider:

GCAllObjectAllocation to get allocations with call stacks from the GCSampledObjectAllocationTraceData events.
GCHeapAndTypeNames and Type to resolve the allocated object types from the GCBulkTypeTraceData events.

I discovered later that the call stack method resolving was dependent on another provider which is automatically enabled by dotnet-trace, more on that in the next post.

Note that GCAllObjectAllocation is a combination of the GCSampledObjectAllocationHigh and GCSampledObjectAllocationLow keywords. It generates a lot of events and has a heavy impact on the target application performance. For most profiling scenarios, enabling GCAllObjectAllocation is clearly not recommended. But it is perfectly fine for replay-based allocation detections: we need to track all allocations and the application slowdown is not a big issue.

However, the EventPipe API has a slight problem:

We need to detect all allocations by enabling the GCAllObjectAllocation keyword, but this keyword must be enabled before the application starts.
The only way to create an EventPipe session is by creating a DiagnosticsClient with a process ID, which implies that the application must be started first.

This puzzle can be solved by tracing the application with dotnet-trace instead of directly using the DiagnosticsClient. dotnet-trace uses internal code to start applications with a diagnostic port, which is the only way to collect traces from startup (and to enable keywords before the application startup).

Here is the command line to run dotnet-trace for our use case:

Parsing traces

EventPipe will generate a nettrace file. This file can be directly processed using EventPipeEventSource, but there is a more convenient option, particularly if you need to extract call stacks. The TraceEvent library includes TraceLog, a trace file parser which exposes a rich object model on top of trace files. TraceLog is based on a specific file format, ETLX, and the nettrace files must be converted to this format before processing.

EventPipeEventSource has a push model, where you subscribe to events and get a callback when they arrive. TraceLog supports both push and pull models.

The reference code to get allocations from our nettrace file would be:

This code uses the nifty CallStack extension method, which is only available for events created by TraceLog and which is the main reason to use TraceLog here.

In order to get the allocated type names from the event TypeID, we can use the GCBulkTypeTraceData event to build a type map:

I discovered this pattern in one of the excellent posts from Christophe Nasarre.

Improving analysis

On the one hand, the replay program generates many heap allocations during the startup phase. The replay engine also generates a few heap allocations during execution, primarily from the event-store event deserialization. It is possible to ignore allocations with specific call stacks, but it might be fragile and break when the replay engine changes.

On the other hand, it would be nice to identify the ID of the event-store events that generated the allocations, to allow developers to debug the code if the call stack is not sufficient to understand the allocation.

To solve both issues, I added a custom event provider in the replay engine. This provider publishes trace events when the processing of an event-store event starts and stops. The trace events include the event-store event ID. The allocations that occur outside of the start-stop boundaries can be ignored, and the ones that occur within the boundaries can be linked to the event-store event ID.

Publishing custom trace event is very simple, you just need to define and use an EventSource:

The trace events can then be directory read using TraceLog:

It is also possible to write a TraceEventParser to create custom event types, but it is probably not useful when you have only a few very simple custom events.

Closing

Using dotnet-trace, TraceLog, and a custom EventSource, we have an effective solution to automatically detect heap allocations in our business components. Our internal replay engine is perfect for profiling our code using GCAllObjectAllocation without impacting the production services performance.

Yet, I feel slightly disappointed. I see a few missing opportunities here:

The profiling process generates large trace files, because there are a lot of heap allocations during the initialization phase, and also because the event-store event deserialization generates heap allocations. It could be better to do “real time” tracing and to drop irrelevant allocations on the fly.
The usage of TraceLog requires converting the nettrace file to ETLX, which increases the required disk space.

Also, I am quite curious, and I wonder how TraceLog is able to extract and resolve call stacks from the events. Can I just open the original nettrace file and parse it manually without using TraceLog? Where are the call stacks stored? And how is the address resolving implemented?

Well, I know the answers now, but that’s a story for the next post!

Many thanks to @romainverdier and @Lucas_Trz for the reviews.