Investigating .NET Out of Memory Exceptions Using Sysinternals ProcDump for Linux
A few weeks ago, I was helping Microsoft support to identify the root cause of a customer's .NET application running into the infamous out of memory exception. The customer's workload was running in a containerized environment where the memory limits of the container were restricted, causing the out of memory exception to surface.
Here we’ll take a look at how the escalation engineers used Sysinternals ProcDump for Linux to resolve the customer problem.
A note on security
A core dump is a static snapshot of all the memory content of a target process. As such, it can contain critically sensitive data including passwords, social security numbers, credit card numbers, encryption keys, etc. Please make sure you have the right processes in place to uphold the confidentiality and integrity of the sensitive nature of core dumps.
What diagnostics data do I collect?
As is often the case during memory related issues, the escalation engineers attempted to get a core dump when the out of memory exception occurred. A core dump is a super helpful file that contains the memory contents of the target process at the time of creation. Since we are looking at an out of memory exception, the core dump should be able to show where excessive memory is being consumed.
There is a plethora of ways of collecting core dumps for .NET and what they all have in common is that it is extremely difficult to generate a core dump at the exact point of an out of memory condition. After all, creating a core dump typically requires allocating some memory that there currently is a shortage of.
Sysinternals ProcDump for Linux to the rescue!
Using the existing dotnet-dump tool requires manual/on-demand execution. In production workloads where problems (such as an out of memory exception) happen at seemingly inconsistent times, chances are slim that an engineer will be able to spot that memory is increasing, connect to the problematic node and run dotnet-dump before the out of memory exception occurs and the target process terminates. Fortunately, Sysinternals ProcDump for Linux solves that problem by automatically monitoring the target process and generating a core dump when a certain event triggers. ProcDump supports a wide range of events for all kinds of interesting diagnostics scenarios. In this case, what we are interested in is generating a core dump when memory is getting closer to the max limit. Better yet, if we can tell ProcDump to generate a number of different core dumps at different thresholds as they get closer to the max limit, it enables us to do a comparison between the core dumps and see which type of memory is increasing in size.
Installing ProcDump
To install ProcDump, head on over to the ProcDump GitHub repository and follow the instructions here.
For example, on Ubuntu 22.04 you would run the following to install ProcDump:
wget -q https://packages.microsoft.com/config/ubuntu/$(lsb_release -rs)/packages-microsoft-prod.deb -O packages-microsoft-prod.deb
sudo dpkg -i packages-microsoft-prod.deb
sudo apt-get update
sudo apt-get install procdump
Running ProcDump
Since ProcDump generates a core dump (using ptrace) which contains the memory contents of the target process, we have to run it using sudo
.
$ sudo ./procdump
...
...
...
Capture Usage:
procdump [-n Count]
[-s Seconds]
[-c|-cl CPU_Usage]
[-m|-ml Commit_Usage1[,Commit_Usage2...]]
[-gcm [<GCGeneration>: | LOH: | POH:]Memory_Usage1[,Memory_Usage2...]]
[-gcgen Generation
[-tc Thread_Threshold]
[-fc FileDescriptor_Threshold]
[-sig Signal_Number]
[-e]
[-f Include_Filter,...]
[-pf Polling_Frequency]
[-o]
[-log]
{
{{[-w] Process_Name | [-pgid] PID} [Dump_File | Dump_Folder]}
}
As you can see, there are a number of different events we can use to tell ProcDump to generate a core dump when they occur. In our case, we wanted to generate a series of core dumps at different .NET memory thresholds so that we could investigate the memory increase. Let’s run the following ProcDump command:
$ sudo ./procdump -gcm 500,700,900 100
...
...
...
Process: TestWebApi (395501)
CPU Threshold: n/a
.NET Memory Threshold: >= 500 MB,700 MB,900 MB
Thread Threshold: n/a
File Descriptor Threshold: n/a
Signal: n/a
Exception monitor n/a
GC Generation 2008
Polling Interval (ms): 1000
Threshold (s): 10
Number of Dumps: 3
Output directory: .
[13:49:48 - INFO]: Starting monitor for process TestWebApi (395501)
Here we told ProcDump to monitor the process with PID 100
(you can also use -w
for process names). We also told ProcDump to wait until the .NET memory of the process exceeds the following thresholds: 500
, 700
and 900MB
using the -gcm
switch. Once each threshold has been reached, a core dump will be generated.
How do we know when they have been generated? In this case the customer already had monitoring in place that notified when the process terminated. Since our memory thresholds are set before the max limit, the core dumps will be generated before the process reaches an out of memory condition and hence process termination. Below we can see the 3 core dumps that were generated:
Polling Interval (ms): 1000
Threshold (s): 10
Number of Dumps: 3
Output directory: .
[14:20:43 - INFO]: Starting monitor for process TestWebApi (395998)
[14:20:48 - INFO]: Core dump generated: ./TestWebApi_1_gc_size_2023-09-03_14:20:47
[14:20:54 - INFO]: Core dump generated: ./TestWebApi_2_gc_size_2023-09-03_14:20:53
[14:21:00 - INFO]: Core dump generated: ./TestWebApi_3_gc_size_2023-09-03_14:20:59
[14:21:00 - INFO]: Stopping monitor for process TestWebApi (395998)
How do I analyze the core dumps?
There are a number of tools you can use (dotnet-dump, LLDB, WinDBG) to analyze core dumps and what they all have in common is that they use the same .NET debugging helper called SOS. SOS contains a number of commands that are super helpful when debugging .NET applications. Regardless of which tool you use, the same SOS commands will be available in your debug session.
SOS is not a call for help, nor is the name inspired by the ABBA song. In the early days of .NET (think 1.0) Microsoft had an internal helper called Strike. It turns out that our customers really wanted to use that helper as well. Before making it public some of the commands were removed (ones that only made sense for engineers working on .NET). Since this was a slightly scaled down version of Strike, it was called Son of Strike (SOS). As a matter of fact, with .NET being open source, if you are lucky, you might still see references to Son of Strike in the source code.
Let’s get back to our debug session. At this point, we’ve been able to get 3 core dumps at .NET memory thresholds 500, 700 and 900MB. Let’s use the dotnet-dump tool to open the dumps for analysis.
Installing dotnet-dump
dotnet-dump is a tool that can be used to analyze .NET core dumps. To install dotnet-dump, run:
$ dotnet tool install -g dotnet-dump
You can invoke the tool using the following command: dotnet-dump
Tool 'dotnet-dump' (version '7.0.442301') was successfully installed.
Analyze the core dumps using dotnet-dump
Once dotnet-dump is installed we can use the following command to analyze the first dump in our series:
$ dotnet-dump analyze TestWebApi_1_gc_size_2023-09-03_14:20:47
Loading core dump: TestWebApi_1_gc_size_2023-09-03_14:20:47 ...
Ready to process analysis commands. Type 'help' to list available commands or 'help [command]' to get detailed help on a command.
Type 'quit' or 'exit' to exit the session.
>
The >
prompt is now waiting for you to type an SOS command that you want to run. To get an idea of the available commands you can type help
.
When it comes to .NET memory, the dumpheap
command is our friend. If you have a core dump that is consuming a lot of memory a great first step is to run:
> dumpheap -stat
Statistics:
MT Count TotalSize Class Name
...
...
...
7fbb5a058080 79 34,812 System.Int32[]
7fbb5a090da8 24 43,790 System.Char[]
7fbb59fab110 108 105,600 System.Object[]
7fbb5a05d2e0 1,702 156,788 System.String
7fbb5b0c6ed0 36 920,416 System.Byte[][]
55b3b0d76a50 57,188 2,104,808 Free
7fbb5ac52988 54,972 550,305,086 System.Byte[]
The dumpheap -stat
command will show you a list of .NET types that are currently on the .NET heaps. The columns are:
MT
— Method Table which uniquely identifies the type.Count
— Number of instances of the type.TotalSize
— Total size of all instances of the type.Class Name
— Name of the type.
Furthermore, dumpheap
is nice enough to sort by TotalSize
which means that the biggest consumers should be at the bottom of the list.
Here we can see that we have 550MB of System.Byte[]
. Is that the culprit that eventually gets us to an out of memory condition? Luckily, ProcDump generated 3 dumps for us (each at an increasing level of memory consumption) and we can open the other 2 and check the size using the same command:
# 2nd core dump
7fbb5ac52988 85,076 852,067,582 System.Byte[]
# 3rd core dump
7fbb5ac52988 110,076 1,102,667,582 System.Byte[]
Sure enough, System.Byte[]
seems to be the culprit as it keeps growing.
Knowing System.Byte[]
is suspect, how do we go about actually finding out why its growing in a seemingly unbounded fashion? If the code base was trivial, perhaps we can get lucky and spot where in the code we use System.Byte[]
. However, in a production work load the codebase will likely be fairly complex and a simple code review is not enough. One might think that if we can find the call stack that led up to the allocation we can also get closer. Unfortunately, call stacks are not recorded as part of core dumps.
Fortunately, there is construct almost as good as call stacks, and it’s called roots. The .NET garbage collector has to keep track of what objects reference other objects to be able to do its job. If we could somehow find out what the roots are for one of the System.Byte[]
, we could stand a better chance of identifying where in the code base it is used and therefore the (no pun intended) root cause. In order to run the gcroot
command we first have to find an instance of a System.Byte[]
. We can list individual instances of a particular type by using dumpheap -mt <method table>
. The method table can be found in the first column of the dumpheap -stat
output that we ran earlier:
...
...
...
7fb9068fb900 7fbb5ac52988 10,024
7fb9068fe040 7fbb5ac52988 10,024
7fb906900780 7fbb5ac52988 10,024
7fb906902ec0 7fbb5ac52988 10,024
7fb906905600 7fbb5ac52988 10,024
7fb906907d40 7fbb5ac52988 10,024
...
...
...
Please note that there can be a fairly large numbers of individual instances depending on memory consumption.
The first column in the output is the pointer to the object instance. We can take the object instance pointer and feed it to gcroot
:
> gcroot 7fb906933fc0
Caching GC roots, this may take a while.
Subsequent runs of this command will be faster.
...
...
...
-> 7fb7440fd8a8 Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware
-> 7fb7440fae50 Microsoft.AspNetCore.Http.RequestDelegate
-> 7fb7440fac38 Microsoft.AspNetCore.Routing.EndpointRoutingMiddleware
-> 7fb7440f7f78 Microsoft.AspNetCore.Http.RequestDelegate
-> 7fb7440f7f00 Microsoft.AspNetCore.Routing.EndpointMiddleware
-> 7fb7440f4198 Microsoft.AspNetCore.Routing.RouteOptions
-> 7fb744045da8 System.Collections.ObjectModel.ObservableCollection<Microsoft.AspNetCore.Routing.EndpointDataSource>
-> 7fb744045de0 System.Collections.Generic.List<Microsoft.AspNetCore.Routing.EndpointDataSource>
-> 7fb7440f4f80 Microsoft.AspNetCore.Routing.EndpointDataSource[]
-> 7fb7440cf9b0 Microsoft.AspNetCore.Routing.ModelEndpointDataSource
-> 7fb7440cf9c8 System.Collections.Generic.List<Microsoft.AspNetCore.Routing.DefaultEndpointConventionBuilder>
-> 7fb7440da070 Microsoft.AspNetCore.Routing.DefaultEndpointConventionBuilder[]
-> 7fb7440d6248 Microsoft.AspNetCore.Routing.DefaultEndpointConventionBuilder
-> 7fb7440d60d8 Microsoft.AspNetCore.Routing.RouteEndpointBuilder
-> 7fb7440d6078 Microsoft.AspNetCore.Http.RequestDelegate
-> 7fb7440d47c8 Microsoft.AspNetCore.Http.RequestDelegateFactory+<>c__DisplayClass36_0
-> 7fb7440d44f0 System.Action
-> 7fb74400f700 Program+<>c__DisplayClass0_0
-> 7fb7440b22e8 ContosoBinaryCache
-> 7fb7440b2300 System.Collections.Generic.List<System.Byte[]>
-> 7fb9540df0e0 System.Byte[][]
-> 7fb906933fc0 System.Byte[]
In the output above, we notice that it did find a root and each of the lines represents a type that holds a reference to the type below it (and so on). The type we asked about is System.Byte[]
and can be seen on the bottom of the reference chain. The application is an ASP.NET application and as such has a bunch of ASP.NET specific references on the top most part of the output. The more interesting part comes closer to the bottom where we can see that there is a ContosoBinaryCache
type that holds a reference to a List<System.Byte[]>
followed by System.Byte[][]
and finally our System.Byte[]
instance. Since I know the ContosoBinaryCache
is in our codebase I can now focus my attention on that part of it and identify the culprit of the memory exhaustion. In this case, it turns out that it was a simple case of an unbounded cache.
Reach out!
Sysinternals ProcDump for Linux is a powerful process monitoring tool that can generate core dumps based on a large plethora of events.
We are always looking at adding support for new events and would love to hear from you if there are new events you would be interested in and/or any feedback in general.
You can find us on GitHub here.