Saving Prod on a Wednesday with Two Lines of Code

7 min readMar 1, 2024

This is a story of how Woo and I, both software engineers on the IDEAs team at Microsoft, saved a production web service with a two-line code change that reduced memory consumption by 5GB.

Introduction

At first on-call engineers would receive low severity alerts related to the service’s health. The issues were transient, and our metrics showed that the service had seemingly “self-healed”. These few alerts soon turned into an avalanche of alerts that would keep on-call engineers awake all throughout the night as machines started hitting 100% memory usage. We observed that our partners’ outgoing reliability to our service had dropped to below three 9s and that various processes were failing.

I think this image from one data center captures the situation. All nodes on all the data centers we used showed similar symptoms.

Zooming in at an individual machine level during a smaller time window revealed that machines had 100% memory usage.

Average available memory in the East Asia data center by VM (Left). One VM’s average available memory in the East Asia data center (Right).

Our partner’s outgoing reliability to our service dropped to below three 9s at moments.

Woo and I decided we needed a memory dump to find opportunities for a quick hotfix change that could significantly reduce memory usage. The alternative was to scale up — but that would take days if not weeks as we would need to rebuild clusters.

There was also no guarantee that scaling up would fix the issue permanently. How much longer until we hit 100% memory usage again? It could have been another week or a month.

The situation that caused the machines to have 100% memory usage appeared transient. A locally rare but globally common occurrence. The plan was to wait for telemetry to report a machine with high memory usage, and then take the memory dump in-the-moment, meaning, while the machine was still serving production traffic.

Getting the Memory Dump

After remoting in, I saw that the machine had high memory usage, and the 26GB (out of 28GB for this VM) was used by the service. I took the memory dump and that created a 30GB file.

What I saw after remoting into the machine.

There is no customer content in these dumps — just a bunch of GUIDs and other metadata. Regardless, all standard Microsoft privacy and security practices were followed before analyzing the memory dump.

Analyzing the Memory Dump

I opened the dump with WinDBG and got to work. I ran the DumpHeap command with -stat to obtain a statistical sampling of the objects in memory. The first surprise in this dump was the 109 million System.String objects that accounted for 9GB of memory.

> !DumpHeap /d -stat 
              MT     Count    TotalSize Class Name
...
00007fffd802c2c8 109658818   9992216730 System.String

Iterating over the strings revealed many were of size 98.

> !DumpHeap /d -mt 00007fffd802c2c8
 ...
000001c6e88db718 00007fffd802c2c8       98     
000001c6e88db798 00007fffd802c2c8       98     
000001c6e88db800 00007fffd802c2c8       98     
000001c6e88db868 00007fffd802c2c8       98     
000001c6e88db8d0 00007fffd802c2c8       98     
000001c6e88db938 00007fffd802c2c8       98     
000001c6e88db9a0 00007fffd802c2c8       98     
000001c6e88dba08 00007fffd802c2c8       98     
000001c6e88dba70 00007fffd802c2c8       98     
000001c6e88dbad8 00007fffd802c2c8       98     
000001c6e88dbb40 00007fffd802c2c8       98     
000001c6e88dbbc0 00007fffd802c2c8       98     
000001c6e88dbc28 00007fffd802c2c8       98     
000001c6e88dbc90 00007fffd802c2c8       98     
000001c6e88dbcf8 00007fffd802c2c8       98

And sampling through some of these strings showed they were stringified GUIDs:

0:000> !DumpObj /d 000001c6e88c0e58
Name:        System.String
...
Size:        98(0x62) bytes
...
String:      4C1C3980-80B6-4715-9D39-C75CC5347B8D

I replaced this sample GUID from the memory dump with a random GUID. I then used gcroot to figure out which object owned these System.String objects.

0:000> !gcroot 000001c6e88c0e58
...
            ->  000001c943c1d520 Microsoft.ServiceA.FrontEnd.Models.DeleteFeedDataHolder
            ->  000001c943c1d548 System.Collections.Generic.HashSet`1[[System.Object, mscorlib]]
            ->  000001d1c25f1020 System.Collections.Generic.HashSet`1+Slot[[System.Object, mscorlib]][]
            ->  000001c6e88c0e58 System.String

And there I saw that the DeleteFeedDataHolder object owned a HashSetthat pointed to these strings. Checking the HashSet I saw that it had 87M items in the set:

0:000> !DumpObj /d 000001c943c1d548
Name:        System.Collections.Generic.HashSet`1[[System.Object, mscorlib]]
...
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007fffd802ebb0  40004ce        8       System.Int32[]  0 instance 000001cc625f1020 m_buckets
00007fffd66a9808  40004cf       10 ...non, mscorlib]][]  0 instance 000001d1c25f1020 m_slots
00007fffd802ec18  40004d0       28         System.Int32  1 instance         87179071 m_count
00007fffd802ec18  40004d1       2c         System.Int32  1 instance         87179071 m_lastIndex

I also used !objsize on the HashSet and saw that the data structure itself took up more than 2GB of memory. This is separate from the 9GB of System.String objects.

There are 87M items in the HashSet, each corresponds to a System.String, and each System.String takes up 98 bytes of memory. 87M*98 bytes is more than 8GB of memory!

I consulted with Woo and he knew through his domain expertise on the scenario that the data source guaranteed these strings to be GUIDs. Woo decided to parse these strings as objects of type System.Guid, the hypothesis being that a GUID will take up significantly less memory.

But how much memory savings would this hotfix provide?

Assessing the Impact of the Potential Hotfix

Each character in the System.String object takes up two bytes. The .Net Framework runtime encodes System.String objects with UTF-16. 36 characters in the GUID * 2 bytes per character = 72 bytes. There is then 26 bytes of metadata that supports an object of type System.String (e.g., 8 bytes pointer to the .Net type). The stringified GUID only had 36 bytes worth of ASCII characters, including the dashes, and yet took up 2.7x the memory.

Each object of type System.Guid has a size of 32 bytes, according to the memory dump. A GUID only requires 16 bytes, but .NET adds 16 more bytes of metadata for this type.

0:000> !dumpobj /d 000001c745214120
Name:        System.Guid
MethodTable: 00007fffd8017310
EEClass:     00007fffd79bec98
Size:        32(0x20) bytes
File:        C:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007fffd802ec18  4000586        8         System.Int32  1 instance       1594435533 _a
00007fffd8016730  4000587        c         System.Int16  1 instance             1534 _b
00007fffd8016730  4000588        e         System.Int16  1 instance            11381 _c
00007fffd802e2d0  4000589       10          System.Byte  1 instance              129 _d
00007fffd802e2d0  400058a       11          System.Byte  1 instance              164 _e
00007fffd802e2d0  400058b       12          System.Byte  1 instance               52 _f
00007fffd802e2d0  400058c       13          System.Byte  1 instance               76 _g
00007fffd802e2d0  400058d       14          System.Byte  1 instance               69 _h
00007fffd802e2d0  400058e       15          System.Byte  1 instance               74 _i
00007fffd802e2d0  400058f       16          System.Byte  1 instance              163 _j
00007fffd802e2d0  4000590       17          System.Byte  1 instance               85 _k
00007fffd8017310  4000585      278          System.Guid  1   shared           static Empty
                                 >> Domain:Value  000001c6c48f5690:000001cac52012c0 <<

The above GUID values were modified from the original dump. That means we should expect a memory reduction closer to 3x, or reduction from 8GB to 2.6 GB. It’s still a significant reduction of approximately 5.5GB of memory.

With this confirmation Woo set out to implement the change.

Code Change

Woo investigated the code base and figured out this simple change. Here’s the change in a static method named AddToListHelper. The change to add Guid.TryParse()is two lines of code if you don’t count the braces.

// Try adding the value to a list that the holder can use
public static bool AddToListHelper(IList<object> list, IdType idType, string idValue)
{
. . .
    if (Guid.TryParse(idValue, out var idValueGuid))
    {
        list.Add(idValueGuid);
    }
. . .
}

The DeleteFeedDataHolder class maintains the HashSet as revealed in the memory dump.

public class DeleteFeedDataHolder
{
    private readonly IdType idType;
    private readonly HashSet<object> idsPerEntity;

    public DeleteFeedDataHolder(long timestamp, IList<object> idsPerEntity, IdType idType)
    {
        this.Timestamp = timestamp;
        this.idType = idType;
        . . .
        this.idsPerEntity = new HashSet<object>(idsPerEntity);
        . . .
    }
}

The constructor of this class accepts the list constructed using AddToListHelperand uses the list to construct the HashSet. Chad, another IDEAs software engineer that reviewed the PR after we deployed the change, pointed out this HashSet constructor will copy elements: HashSet<T> Constructor (System.Collections.Generic) | Microsoft Learn, and therefore there is a moment where there are duplicate objects in memory before the garbage collector marks and sweeps away the initial list. That’s another great change for the future.

Prod is Saved

After deployment, available memory increased by 5GB as expected.

This was a great relief! Recall that the periods of 100% memory only show in this plot after zooming in to smaller time windows.

And max CPU stopped hitting 100%. When we looked at flame graphs when the machine was at 100% CPU, we noticed the CPU was busy spinning waiting on locks to obtain managed threads to queue further tasks. We’re assuming the kernel was busy copying memory pages to and from the swap file and that contributed to CPU usage as well.

For the curious, these were some of the hot stacks in the flamegraph when CPU usage was at 100%.

More importantly, the partner team’s outgoing QoS metrics to our service showed increased reliability and more consistent latency numbers.

Partner’s view of reliability and latency improvements.

The service’s outgoing calls to Redis also saw a reduction in P95 latencies.

P95 Latency reduction in outgoing calls to Redis.

These are the outgoing reliability improvements from the service to Redis cache.

(Left) Improvements to the reliability of incoming calls to the service. (Right) Improvements to the reliability of outgoing calls to Redis.

This small change took about a day to investigate, make the change, and deploy globally. It gave us some time back in this fiscal quarter to focus on new experiences and products that customers will love.

The plan is to include this technical debt as part of next quarter’s planning process. That way we can properly address the underlying technical issues, maintain stellar QoS, and set expectations with leadership on where the team will invest time in the coming quarter.