Optimizing garbage collection in a high load .NET service

Max Nalsky
The Startup
Published in
16 min readDec 12, 2019

Pyrus is used daily by several thousand organizations worldwide. The service’s responsiveness is an important competitive advantage, as it directly affects user experience. Our key performance metric is “percentage of slow queries.” One day we noticed that our application servers tend to freeze up for about 1000 ms every other minute. During these pauses several dozen queries piled up, and customers occasionally observed random delays in UI response times. In this post we search out the reasons for this erratic behavior, and eliminate the bottlenecks in our service caused by the garbage collector.

Modern programming languages can be divided into two groups. In languages like C/C++ or Rust the memory is managed manually, so programmers spend more time on coding, managing object life cycles, and debugging. Memory-related bugs are some of the nastiest and most difficult to debug, so most development today is done in languages with automatic memory management such as Java, C#, Python, Ruby, Go, PHP, JavaScript, and so on. Programmers gain a productivity boost, trading full control over memory for unpredictable pauses introduced by garbage collector (GC) whenever it decides to step in. These pauses may be negligible in small programs, but as the number of objects increases, along with the rate of object creation, garbage collection starts to add considerably to the program running time.

Pyrus web servers run on a .NET platform, which offers automatic memory management. Most of the garbage collections are “stop-the-world” ones: they suspend all threads in the app. Actually, so-called background GC’s pause all threads too, but very briefly. While the threads are blocked, the server isn’t processing queries, so those that are already there freeze up, and new ones are queued. As a result, queries that were being processed at the moment when the GC subroutine started are processed more slowly, and the processing of the queries right behind those in line slows down, too. All of this influences the “percentage of slow queries” metric.

Armed with a copy of Konrad Kokosa’s book Pro .NET Memory Management we have begun to look into the problem.

Measurement

We began the profiling of the application servers with the PerfView utility. It’s designed specifically for profiling .NET apps. Based on the Event Tracing for Windows (ETW) mechanism, it is minimally invasive in terms of the app’s performance degradation under profiling. You can easily use PerfView on a live production server. You can also control what type of events and what kind of information you are collecting: if you collect nothing, then the impact on app performance is zero. Another upside is that PerfView doesn’t require you to recompile or restart your app.

Let’s run a PerfView trace with the parameter /GCCollectOnly (trace duration 90 minutes). We are collecting only GC events, so there’s a minimal effect on performance. Now let’s take a look at the Memory Group / GCStats trace report; inside let’s examine the summary of GC events:

Right away, we see several interesting indicators:

  • The average pause duration for Gen 2 collection is 700 milliseconds, while the maximum pause is about a second. This number represents the interval during which all threads in the .NET app stop. Each query being processed will be affected by this pause.
  • The number of collections in Gen 2 is comparable to that in Gen 1, and only somewhat lower than in Gen 0.
  • In the Induced column we see 53 collections in Gen 2. Induced collection is the result of an explicit call from GC.Collect(). We didn’t find invocations of this method in our code, so the culprit is one of the libraries used by our app.

Let’s talk about the number of garbage collections. The idea to separate objects by life cycle is based on the generational hypothesis: most objects die quickly, while survivors tend to live for a long time (in other words, there are few objects with medium lifespan). The .NET garbage collector expects the program to adhere to this pattern and works best in this case: there should be way less garbage collections in Gen 2, than in Gen 0. So, to optimize for the garbage collector, we have to build our app to conform to the generational hypothesis. Objects must either die quickly, and not survive until the oldest generation; or else, they must survive to the oldest generation, and live there permanently. This assumption holds for other languages with generational garbage collectors, for example, Java.

Another chart from the GCStats report shows us some other interesting data:

Here we see cases where the app tries to create a large object (in the .NET Framework, objects larger than 85,000 bytes are created in LOH — Large Object Heap), and it has to wait for the Gen 2 collection, which is happening simultaneously in the background, to complete. These allocator pauses aren’t as critical as the garbage collector pauses, because they only influence one thread. Before, we used .NET Framework version 4.6.1, and Microsoft improved garbage collector in version 4.7.1; now it allows allocating memory from Large Object Heap during the background Gen 2 collection: https://docs.microsoft.com/ru-ru/dotnet/framework/whats-new/#common-language-runtime-clr

So, we updated to the then-latest version 4.7.2.

Gen 2 Collections

Why do we have so many Gen 0 collections? The first theory is that we have a memory leak. To test this hypothesis, let’s look at the size of Gen 2 (we set up monitoring of the appropriate performance counters in our monitoring tool, Zabbix). The graph of the Gen 2 size for the two Pyrus servers shows that the size increases at first (mostly due to warming up the caches), but then it flattens out (large dips on the graph reflect a scheduled service restart during release deployment):

This means there are no noticeable memory leaks, so many of the Gen 2 collections must have occured due to another reason. The next hypothesis is excessive memory traffic: many objects get promoted into Gen 2 and die there. To locate these objects, PerfView has the /GCOnly setting. In the trace report let’s look at the Gen 2 Object Deaths (Coarse Sampling) Stacks, which shows the objects that died in Gen 2, along with the call stacks pointing to source code locations where the objects were created. Here is the result:

When we drill down the row, we see the call stacks of code locations where objects that survive to Gen 2 are created. They include:

  • System.Byte[] Looking inside, we see that more than half are buffers used for JSON serialization:
  • Slot[System.Int32][] (this is part of the HashSet implementation), System.Int32[], and so on. This is our own code, which calculates caches for the client — forms, catalogs, contacts, etc. This data is specific to the current user, it is prepared on the server and sent to the browser or mobile app to be cached there for fast UX:

It’s noteworthy that both JSON buffers and cache calculation buffers are all temporary objects with a lifespan of a single query. So why are they surviving into Gen 2? Note that all of these objects are quite large. Since they’re over 85,000 bytes, memory is allocated from Large Object Heap, which is only collected together with Gen 2.

To double-check, let’s look at the PerfView /GCOnly results, GC Heap Alloc Ignore Free (Coarse Sampling) Stacks. Here, we see a LargeObject row, in which PerfView groups large object allocations; inside it we see the same arrays that we have seen in the previous analysis. We thus confirm the main reason for our GC problems: we are creating too many temporary large objects.

Changes in Pyrus

Based on the measurements, we have identified two problematic areas we need to address. Both areas relate to large objects: client cache calculations, and JSON serialization. There are a few ways to remedy this:

  • The easiest way is to not create large objects in the first place. For example, if a large buffer B is used in data transformations sequence A → B → C, you can sometimes combine the transformations, get rid of object B, and turn it into A → C. This is the simplest and most effective method, but its applicability is sometimes limited for code readability reasons.
  • Object pooling. This is a known technique. Rather than always creating and discarding new objects, which puts pressure on the garbage collector, we can maintain a collection of unused objects. In the simplest implementation, when we need a new object, we take it from the pool; only if the pool is empty do we create a new one. When we no longer need the object, we return it to the pool. A good example of this approach is ArrayPool in the .NET Core, which is also available in the .NET Framework, as part of Nuget package System.Buffers.
  • Use small objects instead of large ones. By making objects small, we can make the app allocate temporary objects in Gen 0 rather than in LOH. So the stress on the garbage collector is moved from Gen 2 collections to Gen 0 and Gen 1 collections, and this is precisely the scenario for which generational GC is optimized.

Let’s look at our two large object cases in more detail.

Client Cache Evaluation

The Pyrus web app and mobile apps cache data that is accessible to users (projects, forms, users, etc.). The caches are calculated on the server, then transferred to the client. They are different for each user, because they depend on that user’s privileges. They are also updated pretty often, for example when the user gets access to a different form template or other Pyrus object.

So, a large number of client cache evaluations regularly takes place on the server, which creates many temporary objects with a short lifespan. If the user is part of a large organization, they might receive access to many objects, so the client caches for that user will also be large. That is precisely why we saw memory being allocated for large temporary objects in Large Object Heap.

Let’s analyze the options proposed for getting rid of large object creation:

  1. Completely get rid of large objects. This approach is not applicable, because the data processing algorithms use, among other things, sorting and aggregation, which requires temporary buffers.
  2. Using an object pool. This approach has certain complications too:
  • The diversity of collections used, and of the types of elements they contain: HashSet, List, and Array. The collections contain Int32, Int64, and other different data types. Each type used needs its own pool, which will furthermore have to maintain collections of different sizes.
  • The collection has a complex life cycle. To get the benefits of pooling, we’ll have to return objects to the pool after using them. This is easy to do when objects are created and discarded in a single method, class, or somewhat close in the code. Our case is a bit more challenging, because many large objects travel between methods, are stored in data structures, then transferred to other structures, and so on.
  • Implementation complexity. There is ArrayPool implementation available from Microsoft, but we also need List and HashSet. We have not found a suitable library on the web, so we would have to do the implementation ourselves.

3. Using small objects. A large object can be broken up into several smaller pieces, which will not overload Large Object Heap. They will be created in Gen 0, and get promoted to Gen 1 and Gen 2 in the standard way. We hope they will not survive to Gen 2, but will instead be collected by the garbage collector in Gen 0, or, at the latest, in Gen 1. The advantage of this approach is that it will require minimal changes to the existing code. Complications:

  • Implementation. We have not found appropriate libraries, so we would have to code our own. The lack of libraries is understandable, as “collections that don’t overload the Large Object Heap” is probably a pretty rare use case.

We have decided to go with the third option: implementing a List and HashSet that won’t overload the LOH.

Chunked List

Our ChunkedList<T> implements standard interfaces, including IList<T>. Therefore only minimal changes to the existing code are required. We use Json.NET library for serialization, which can serialize any collection that implements IEnumerable<T>:

The standard List<T> has the following fields: аn array for storing elements, and the number of actually stored elements. ChunkedList<T> contains an array of arrays of elements, the number of completely filled arrays, and the number of stored elements in the last array. Each of the arrays of elements is less than 85,000 bytes:

Since the structure of ChunkedList<T> is a bit complex, we have ensured it was properly covered by unit tests. Any operation must be tested in at least 2 modes: “small,” when the whole list fits into one chunk under 85,000 bytes, and “large,” when it is split into more than one chunk. For methods that change size (e.g. Add), there are even more scenarios: “small” → “small”; “small” → “large”; “large → “large”; “large” → “small”. Here we get many edge cases, which unit tests are good at sorting out.

The good news is that we don’t have to implement all the methods of the IList interface, but only those that are actually used in our project. Also, coding unit tests is pretty straightforward: ChunkedList<T> should behave the same as List<T>. In other words, we can structure all the tests like this: create List<T> and ChunkedList<T>, conduct the same operations on both, and compare the results.

We measured performance with the help of the BenchmarkDotNet library, to make sure we weren’t slowing down our code too much while moving from List<T> to ChunkedList<T>. Let’s test, for example, the addition of entities to the list:

And the same test, using List<T> for comparison. Here are the results after adding 500 entities (everything fits in one array)¹:

Here’s what we see after adding 50000 entities (broken up into several arrays):

¹A detailed description of the columns:

The Mean column shows the average time it took for the test to complete. You can see that our implementation is only 2–2.5 times slower than standard. Considering that in production code list operations comprise only a small part of all the calculations performed, this difference is tolerable. The Gen2/1k op column, however (number of Gen 2 collections over 1000 runs of the test) shows that we have achieved our goal: ChunkedList does not create garbage in Gen 2 when there is a large number of elements, which is what we were after.

Chunked Hash Set

Similarly, ChunkedHashSet<T> implements the ISet<T> interface. When implementing ChunkedHashSet<T>, we reused the logic of breaking it up into smaller pieces already implemented in ChunkedList. For that, we took an implementation of HashSet<T> from the .NET Reference Source, available with an MIT license, and replaced the arrays in it with ChunkedList-s.

Let’s use the same trick for the unit tests, that we did for the lists: let’s compare the behavior of ChunkedHashSet<T> with the reference HashSet<T>.

Finally, the performance tests. We use set union operation pretty often, so that’s what we’re going to test:

And the exact same test for the standard HashSet. The first test is for small sets:

And the second test for large sets, which caused problems with the Large Objects Heap:

The results are similar to the list results. ChunkedHashSet is 2–2.5 times slower, but with large sets it pressures Gen 2 two orders of magnitude less.

JSON Serialization

The Pyrus application server internal and public APIs use different serialization mechanisms. We have discovered that when third party developers deployed bots that used public API, large objects were created. The problem manifested itself when public API usage started to grow, and the number of queries and the data volume in the server responses increased. Our internal API was not susceptible to this problem.

So we set off to optimize the public API. From our prior experience dealing with internal API we know that it’s best to stream a response to the client, without creating an intermediate buffer.

Upon close inspection of the public API, we found out that during serialization we were creating a temporary buffer (“content” is an array of bytes that contains JSON in the UTF-8 encoding):

Let’s track down where “content” is used. For historical reasons, public API is based on WCF, in which the standard format for queries and replies is XML. In our case the XML reply contains one element named “Binary”; inside it there is JSON encoded in Base64:

Notice that the temporary buffer isn’t really necessary here. JSON can be written directly into the XmlWriter, which is provided to us by WCF, encoding it in Base64 on the fly. This is a good example of the first way to deal with large objects allocation: not allocating them at all.

Here, Base64Writer is a simple wrapper around XmlWriter, which implements the Stream interface that writes into XmlWriter in Base64 encoding. Out of the whole interface, all we have to implement is one method, Write, which gets called in StreamWriter:

Induced GC

Let’s try to figure out the source of these mysterious induced garbage collections. We rechecked our code for GC.Collect calls multiple times, to no avail. We were able to capture these events in PerfView, but the call stack is not too informative (the event is DotNETRuntime/GC/Triggered):

There’s a small catch: the RecycleLimitMonitor.RaiseRecycleLimitEvent call before the induced GC. Let’s look through the call stack for the RaiseRecycleLimitEvent:

The method names are pretty self-explanatory:

  • RecycleLimitMonitor.RecycleLimitMonitorSingleton constructor creates a timer that calls PBytesMonitorThread at certain intervals.
  • PBytesMonitorThread collects memory usage statistics, and under some circumstances, calls CollectInfrequently.
  • CollectInfrequently calls AlertProxyMonitors, receives a bool, and calls GC.Collect() if the result is true. It also monitors the time since the last GC call, and does not call it very often.
  • AlertProxyMonitors goes through the list of IIS web apps launched, obtains the appropriate RecycleLimitMonitor object for each one, and calls RaiseRecycleLimitEvent.
  • RaiseRecycleLimitEvent obtains the IObserver<RecycleLimitInfo> list. The processors receive RecycleLimitInfo as a parameter, in which they can set the RequestGC flag, which returns to CollectInfrequently, thereby triggering an induced garbage collection.

Further investigation shows that the IObserver<RecycleLimitInfo> handlers are added in RecycleLimitMonitor.Subscribe() method, which is called in AspNetMemoryMonitor.Subscribe() method. Also, the default IObserver<RecycleLimitInfo> (the RecycleLimitObserver class) handler is automatically added in the AspNetMemoryMonitor class, itcleans ASP.NET caches, also sometimes triggering a garbage collection.

The mystery of Induced GC is almost solved. We’ve yet to figure out the reason why this GC is called. RecycleLimitMonitor monitors IIS memory use (specifically, the Private Bytes number), and when its usage approaches a certain threshold it starts to fire the RaiseRecycleLimitEvent event. The value of AspNetMemoryMonitor.ProcessPrivateBytesLimit is used as the memory threshold, and it’s calculated as follows:

  • if Private Memory Limit (KB) is set for the Application Pool in IIS, then the value (in kilobytes) is taken from there;
  • otherwise, for 64-bit systems, 60% of physical memory is taken (the logic for 32-bit systems is more complex).

The conclusion of this investigation is: ASP.NET approaches its memory usage limit and starts regularly calling GC. No value was set for Private Memory Limit (KB), so ASP.NET capped at 60% of physical memory. The problem was masked by the fact that the Windows Task Manager showed a lot of free memory, and it seemed like there was plenty. We raised the value of Private Memory Limit (KB) in the Application Pool settings in IIS to 80% of physical memory. This stimulates ASP.NET to use more of the available memory. We also added monitoring of the .NET CLR Memory / # Induced GC performance counter so we wouldn’t miss the next time ASP.NET decides it is approaching the limit of available memory.

Repeat measurements

Let’s look at what happened with the GC after all these changes. Let’s start with PerfView /GCCollectOnly (trace duration is 1 hour), GCStats report:

We can see the Gen 2 collections now happen two orders of magnitude less often than Gen 0 and Gen 1. The duration of these collections has also decreased. There are no more induced GC’s at all. Let’s look at the list of Gen 2 GC’s:

In the Gen column we can see that all Gen 2 collections have become background (2B means Gen 2, Background). That means a large part of the GC’s work is done without suspending the app, and all threads are blocked only briefly (the Pause MSec column). Let’s look at the pauses that happen when large objects are created:

We can see that the number of these pauses when large objects are being created has significantly decreased.

Conclusions

This effort helped us understand what performance counters work best for keeping track of memory usage and garbage collection. We constantly monitor these counters, and when the event is triggered, we investigate the root cause to figure out whether it was a bug in the app or a spike in regular user activity:

Thanks to the changes described in the article we have been able to significantly lower the number and duration of Gen 2 GC’s. We were able to find the reason for induced GC’s, and get rid of them. The frequency of Gen 0 and Gen 1 GC’s has increased tenfold (from 1 per minute to 10 per minute), but their average duration has decreased (from ~200 ms, to ~60 ms). The maximum duration of Gen 0 and Gen 1 GC’s has decreased, but not that much. Gen 2 GC’s have become faster, and long pauses of up to 1000 ms have almost completely gone away.

As for our key metric “slow queries percentage,” after all the changes we made, it dropped by 40%. Not bad for our users.

Thanks to Dan Veksler, Nick Mikheev, Dmitry Grigoriev, and Sergey Kulakov for helping craft this piece.

--

--