Postmortem: Comparison Site Memory Leak

Roger Creyke
3 min readDec 4, 2018

--

Take only what you need, and make sure you give it back when you are done.

Executive Summary

Recently our comparison site loaded but was unable to display any search result data or summary price data, instead showing users a preloader. The root cause was a problem with a code library supplied by Microsoft which we depend on. This has affected many other consumers of this library besides us with similar consequences. An upgrade of this library was therefore the resolution.

Business Impact

Users of the comparison site were unable to compare prices for sports fixtures during this time and the site was unresponsive.

Downtime

Cumulative Downtime: 39 minutes
Cumulative Partial Downtime: 0 hours

Critical Timeline (UTC)

2018–12–03 20:30 - First “Find My Bet” failure occurred
2018–12–03 20:40 - Identified by engineering team
2018–12–03 20:45 - Investigation occurred
2018–12–03 21:09 - Live issue resolved by web application restart

Technical Details

A memory leak (where memory is allocated but not cleaned up after use) was observed when comparing the executing process on the live environment to that on the staging environment with the live environment process private memory sitting at 6.3 times the memory of that on staging.

POSTs to Filters/Search were returning an HTTP 500 error and throwing a System.OutOfMemoryException internally. Therefore the live application’s available memory was exhausted.

Live Environment

Staging Environment

Profiling with DotMemory on a local build highlighted a significant allocation of 1.5mb per search request, although a GC pass appeared to clean out most of this.

The local process was run idly overnight and memory crept from 150.17MB total to 259.29MB over a 14 hour period with no user interaction to the site. Therefore the process appears to have an idle memory leak.

Overnight run with 5.0.2

This memory leak issue was identified as a potential candidate as we rely heavily on the assembly in question: Microsoft.Azure.Search 5.0.2

The root dependency in question was Microsoft.Rest.ClientRuntime 2.3.13 which we decompiled. This was adding multiple event handler registrations for individual instances of RetryDelegatingHandler, therefore preventing the GC pass from cleaning it up.

Old implementation (left)

New implementation (right)

The product was updated locally to Microsoft.Azure.Search 5.0.3 (known to have resolved this issue) and exhibited improvements.

Overnight run with 5.0.3

A pull request was issued to get this update into the main code-base and resolve the primary source of memory leaking. This was deployed the following day as the memory leak was manageable and a hotfix was deemed unnecessary.

Future Mitigations
We highlighted the need to monitor search successes on the site with our watchdog. We also highlighted the need to monitor memory usage over time which we believe we can do more efficiently once the site is in a container and run on Kubernetes. We agreed that non-trivial releases needed to run on staging for a yet undetermined number of hours prior to releasing to live so we can validate there are at least no obvious idle memory leaks.

--

--