Optimizing garbage collection in a high load .NET service
--
Pyrus is used daily by several thousand organizations worldwide. The service’s responsiveness is an important competitive advantage, as it directly affects user experience. Our key performance metric is “percentage of slow queries.” One day we noticed that our application servers tend to freeze up for about 1000 ms every other minute. During these pauses several dozen queries piled up, and customers occasionally observed random delays in UI response times. In this post we search out the reasons for this erratic behavior, and eliminate the bottlenecks in our service caused by the garbage collector.
Modern programming languages can be divided into two groups. In languages like C/C++ or Rust the memory is managed manually, so programmers spend more time on coding, managing object life cycles, and debugging. Memory-related bugs are some of the nastiest and most difficult to debug, so most development today is done in languages with automatic memory management such as Java, C#, Python, Ruby, Go, PHP, JavaScript, and so on. Programmers gain a productivity boost, trading full control over memory for unpredictable pauses introduced by garbage collector (GC) whenever it decides to step in. These pauses may be negligible in small programs, but as the number of objects increases, along with the rate of object creation, garbage collection starts to add considerably to the program running time.
Pyrus web servers run on a .NET platform, which offers automatic memory management. Most of the garbage collections are “stop-the-world” ones: they suspend all threads in the app. Actually, so-called background GC’s pause all threads too, but very briefly. While the threads are blocked, the server isn’t processing queries, so those that are already there freeze up, and new ones are queued. As a result, queries that were being processed at the moment when the GC subroutine started are processed more slowly, and the processing of the queries right behind those in line slows down, too. All of this influences the “percentage of slow queries” metric.
Armed with a copy of Konrad Kokosa’s book Pro .NET Memory Management we have begun to look into the problem.