Optimizing garbage collection in a high load .NET service
Pyrus is used daily by several thousand organizations worldwide. The service’s responsiveness is an important competitive advantage, as it directly affects user experience. Our key performance metric is “percentage of slow queries.” One day we noticed that our application servers tend to freeze up for about 1000 ms every other minute. During these pauses several dozen queries piled up, and customers occasionally observed random delays in UI response times. In this post we search out the reasons for this erratic behavior, and eliminate the bottlenecks in our service caused by the garbage collector.
Pyrus web servers run on a .NET platform, which offers automatic memory management. Most of the garbage collections are “stop-the-world” ones: they suspend all threads in the app. Actually, so-called background GC’s pause all threads too, but very briefly. While the threads are blocked, the server isn’t processing queries, so those that are already there freeze up, and new ones are queued. As a result, queries that were being processed at the moment when the GC subroutine started are processed more slowly, and the processing of the queries right behind those in line slows down, too. All of this influences the “percentage of slow queries” metric.
Armed with a copy of Konrad Kokosa’s book Pro .NET Memory Management we have begun to look into the problem.