Understand why .net core GC keywords are enabled

LEOCADIE Grégory
Criteo Tech Blog
Published in
5 min readJun 24, 2020

… with a debugger

Photo by Jon Tyson on Unsplash

In his article, Kevin explained how he inspected a live process to check if GC verbose events were enabled or not. I asked him about the starting point of his investigation so I can fix the issue. He showed me the following stripped out perf output:

As you can see in the above output, the payload (i.e. getting the type name of the allocated object) of the AllocationTick event is constructed before the event is sent.

I see two issues here:

  • As Kevin mentioned, this can have a performance impact on the application because computing the payload could be expensive
  • But more importantly, I did not configure the CLR to emit Verbose events so the AllocationTick event should not even be emitted!!!

First, let’s have a look at our architecture to understand how we collect CLR events.

Architecture

At Criteo, our .Net Core applications run on Linux in a Mesos Container. In this container, we deploy a tool alongside the application: MetricsCollector. This tool retrieves the CLR events from the main application, build metrics from those events, and sends those metrics to graphite servers.

We also build and use custom CLR. This way, our applications benefit from fixes we had to write but that is not accepted yet. We are running on version 3.1 that includes the EventPipe feature to retrieve CLR events.

There are 2 places where we instruct the CLR to emit events:

  • In the MetricsCollector, we instruct the CLR to emit the events though EventPipes

As you can see, we want CLR informational events about the GC, Contention, and Exceptions. As a startup, the MetricsCollector sends this configuration to the CLR through IPC.

  • We also set the environment variable COMPlus_EnableEventLog to 1. This activates CLR events emission but through the LTTng tracing framework. This setup is very important because it allows us to investigate ad-hoc performance issues in production using the perfcollect.sh script. It generates a file that we analyze in Perfview on Windows.

Now what?

As Kevin mentioned, the verbose level of CLR events was stored in the enableLevels array. I used his script to confirm the value and I needed another way to check this value (trust but verify😜)

He also mentioned that we (Criteo) had trouble investigating issues with LLDB. But I was feeling playful and I installed a standalone version of LLDB in my container to start the investigation. I got the PID of my application, attached LLDB to the corresponding process and looked into the content of the enabledLevels array

And we can confirm that the value is set to Verbose (mixed feelings, happy because I managed to confirm the issue and disappointed because it worked without any problem)

But how was this variable set?

I followed the code path in the runtime source code and ended up in the UpdateGCEventStatus method in the gcenv.ee.cpp file

The line above checks if GCJoin_V2 event is activated either through LTTng or through EventPipe (GCJoin_V2 is a verbose event)

The code of those 2 functions is generated at a compiled time by genLttngProvide.py and genEventPipe.py. I built the release/3.1 branch on our build machine to be able to inspect the generated files.

Once compiled, I started by looking at the source of EventPipeEventEnabledGCJoin_V2 function (bin/obj/Linux.x64.Release/src/vm/eventing/eventpipe/eventpipe/dotnetruntime.cpp).

EventPipeEventGCJoin_V2 is a global variable of type EventPipeEvent and declared in the same file. The source code of the EventPipeEvent class can be found in src/vm/eventpipeevent.cpp file:

So, back in LLDB to get the value of m_enableMask field and check if the event is enabled:

So GCJoin_V2 event is clearly not activated via EventPipes.

Now, let’s have a look at the EventXplatEnabledGCJoin_V2 function implementation (bin/obj/Linux.x64.Release/src/pal/src/eventprovider/lttngprovider/lttng/eventprovdotnetruntime.cpp)

The fact that XPlatEventLoggerIsEnabled does not take any parameter is weird because it means that enabling the GCJoin event is decided globally not based on that particular event or even any specific keyword!

The implementation of XplatEventLoggerIsEnabled is found in src/vm/eventtrace.cpp:

Look into src/inc/eventtracebase.h for the code that does the check:

The IsEventLoggingEnabled function returns the COMPlus_EnableEventLog environment variable value. As we set the value to 1, which is cast to the true boolean value. In the end, the EventXplatEnabledGCJoin_V2 function returns true, which sets the level of GC events to verbose.

Removing the environment variable could be a solution, but it’s not an acceptable solution for us as we regularly investigate issues in production using the perfcollect.sh script. This script collects CPU samples and also CLR events sent through LTTng. If this environment variable is not set, we won’t get any CLR events.

But something was bothering me: it should be possible to decide if an event should be emitted based on smaller granularity than a global setting that applies to all events. I spent a lot of time playing around with LTTng and I knew for a fact that the function tracepoint_enabled checks if a tracepoint is enabled (in our case, GCJoin_V2 tracepoint). I was wondering why this function was not used at all.

I had a second look at the eventprovdotnetruntime.cpp file that contains the source code of EventXplatEnbledGCJoin_V2 function and found this compilation directive:

The sample above means that “if tracepoint_enabled does not exist, replace the calls to tracepoint_enabled by XplatEventLoggerIsEnabled”: the name granularity is lost…

I looked into the LTTng include files on the build machine and did not find tracepoint_enabled definition. I discovered that this function was introduced in version 2.6 of lttng-ust library and on the build machine, only version 2.4 was installed.

How did we fix it?

At Criteo, we use the 2.10 version of LTTng on Centos 7.2 Linux distribution. It’s safe enough to update to 2.10 version on the build machine. I followed this documentation (for Red Hat Enterprise Linux 7) to install LTTng 2.10 and built the runtime anew.

When done, I started first by looking at the implementation of EventXplatEnabledGCJoin_V2 in bin/obj/Linux.x64.Release/src/pal/src/eventprovider/lttngprovider/lttng/eventprovdotnetruntime.cpp

So now we get the expected event granularity!

And inspecting the enabledLevels variable using LLDB confirmed the validity of the fix:

Interested in joining the crowd? Check out our current openings:

https://careers.criteo.com/working-in-R&D

--

--