Build your own .NET CPU profiler in C#

Christophe Nasarre
Dec 8, 2020 · 6 min read

The last series was describing how to get details about your .NET application allocation patterns in C#.

  1. Get a sampling of .NET application allocations
  2. A simple way to get the call stack
  3. Getting the call stack by hand

It is now time to do the same but for the CPU consumption of your .NET applications.

Thanks you Mr Windows Kernel!

Under Windows, the kernel ETW provider allows you to get notified every milli-second with the call stack of all threads running on a core. Without any surprise, it is easy with TraceEvent to listen to these events. As explained in an old posts, you simply need to create a session, enable providers and listen to the right event.

For sampled CPU profiling, I’m using the TraceLogEventSource to wrap the event source and automatically get the stack frames symbol resolution:

You need to enable three providers:

  • Kernel: get the profiling event every milli-second and be notified when a dll gets loaded by a process to let TraceEvent manage the symbols
  • Clr: get JIT events describing managed method details
  • ClrRundown: get already JITted methods details

The code to handle the event is really simple:

I’m only interested in profiling a given process (hence the check on process id) and events with a call stack. The callstack is returned by the extension method CallStack() (see the previous post for more details). The main processing is done by the MergeCallStack() method. But before looking at the only complicated part, it is time to discuss a useful tip.

Tip: use ETLx Luke!

Like the previous posts about memory profiling, my goal is to demonstrate how to monitor applications as they run. However when you monitor an application CPU consumption, you would like to avoid any noisy neighbor that could highjack some cores. So minimizing the work of your profiling code is always a good idea. In addition, it could also be valuable to record the events and analyze them later. Microsoft Perfview is the open source tool that I’m using the most to dig into CPU consumption. So the solution is to simply record the events and generate an .etlx file for Perfview.

The first code change is small: the session is created with a filename.

I’m using a naming convention that contains the process ID I want to monitor so it will be easy to remember when I will analyze the recording in Perfview:

The second step to generate the .etlx file is a one liner:

The ConversionLog TraceLogOptions property is expecting a TextWriter to log all possible messages related to symbols resolution.

The parsing of kernel profiling samples is done on the TraceLog in a more manual way by selecting the events based on TaskGuid corresponding to the kernel profiling task and the OpCode:

How to “merge” call stacks

In both live and file based implementations, I end up merging call stacks by calling the MergeCallStack() method. Instead of jumping directly into the C# code, I prefer to describe what I’m expecting from “merging“ call stacks.

If you think about what frames (i.e. method call) would appear at the beginning all these threads call stacks, it seems obvious that they should start with the same code: either the main thread startup, timer/thread pool initialization or custom thread bootstrap. In case of server applications, the same request processing calls would lead to specific handlers or controllers code. Each time a common group of frames appears in different call stacks, it would be more readable to see them as different branches starting from the same trunk like in Visual Studio Parallel Stack panel.

In order to build a “visual” representation, I have to count the number of time each frame appears at the same place in the recorded call stacks. My data structure looks like a tree where each node contains the current frame, the sampling count (as node or as leaf) and a list of different child frames corresponding to the different execution branches:

Each frame contains both the address and the method signature that have been extracted from the callstack retrieved from the events:

The MergedSymbolicStack.AddStack() method is doing the real merging. The idea of merging call stacks is to start from the bottom and if the frame has already been seen (at this position), increment its sampling count. If not, remember it before incrementing the count. Look at the next frame and do the same match/remember + increment up to the top of the stack.

Here is an animation of what it would look like on a piece of paper (like the one I wrote down before starting to write the C# implementation :^)

Here is the corresponding C# code to merge a stack (i.e. an array of frames)

Last but not least, the constructors of the class reflect how to (1) create the root instance and (2) each node in the tree:

The code to render the merged stack

is not that complicated because everything is already in the tree of frames.

The IRenderer interface implementations are simply changing foreground color depending on what kind of information to display:

I have used the same “Visitor” pattern for the pstack tool/extension for WinDBG.

Not for Admin only

I always thought that I needed to be a member of the Administrator group and running elevated to be allowed to start a kernel profiling session. Well… This is in fact not the case! You have to dig into the documentation for configuring and starting a SystemTraceProvider session to read the following note:

If you want a non-administrators or a non-TCB process to be able to start a profiling trace session using the SystemTraceProvider on behalf of third party applications, then you need to grant the user profile privilege and then add this user to both the session GUID (created for the logger session) and the system trace provider GUID to enable the system trace provider. For more information, see the EventAccessControl function.

Long story short, you need a user to be part of the Performance Log Users group (makes sense) or grant her the TRACELOG_ACCESS_REALTIME permission. Obviously, you need an administrator account to do both but this can be done once on a machine by your IT in a secure way.

I wrapped a managed implementation of the corresponding code to add the permission in a ProfilingPermission class that hides all the P/Invoke and weird marshalling stuff to the native Windows API. Simply pass a user name to EnableProfileUser() and it should work just fine.

You are now ready to profile your application memory allocation patterns and CPU consumption!

Criteo R&D Blog

Tech stories from the R&D team

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store