The case of iOS OOM Crashes at Compass

Context

Bastien Falcou
Compass True North
16 min readOct 29, 2019

--

The Compass mobile team has been growing dramatically in the last few months. With the fast increase in size of codebase inevitably came an inflation of crashes. We quickly realized that it was time to bring the efforts dedicated to the quality of our product to the next level.

We started to look for the best Crash Reporting tools. We integrated them with an incident response platform. We strengthened our processes around quality by setting up an on call rotation and planning more regular memory management quality checks. The percentage of crash-free users improved. However, despite our best efforts, our customers were still reporting crashes.

We started to wonder: are there some types of crashes that do not get caught by our crash reporting system?

We quickly realized the existence of a “black box” around out-of-memory crashes. We had no idea how many times they happened, what triggered them, how many users were impacted, and how much the user experience of our app was suffering. And perhaps it wasn’t even so bad after all—but we needed confidence.

What is an OOM crash?

Before diving deeper into the topic, it is important to understand the basics of memory management.

What is memory management?

Memory refers to all mechanisms involved in storing information on your device. Your iPhone has two main ways of storing data (i) the hard drive, or disk, persisting data even when the phone is powered off (ii) the Random Access Memory, or RAM, whose memory is lost upon powering off the device.

When running an app on your device, the system will claim a chunk of the RAM called the heap that it will assign to it. This is the place where all of your reference type variable instances will live while the app is running. Memory management is the process of managing heap memory, throughout the life cycle of these objects and making sure that they are freed when they are no longer needed so the memory can be reused.

Managing the heap memory is important because some objects can be very large and our apps get only a limited amount of memory from the system. Running low on memory will cause an iOS app to run slower and eventually to be killed by the system (= crash). Nowadays, it is getting less and less common to see a RAM overload since our devices are getting increasingly more powerful. It is nonetheless always important to remain a good memory citizen.

Note: because value types are generally not allocated on the heap but instead statically on the stack, it is usually a good idea to rely on struct and enum whenever possible. Take this piece with caution: it is not longer true when we put a reference type inside a value type. Then it will also be in the heap, making memory leaks possible again.

Out-of-memory crash

An OOM crash occurs anytime an app is killed by the system because it over-used RAM, when the OS decided to reclaim the memory for other processes. It can happen both when the app is in the Foreground or the Background.

Reducing / preventing OOM crashes is not just good for your app. It is also about being a respectful app citizen even when your app is in the background. A user will be more likely to manually kill all of his backgrounded apps if the phone is overall slower or his battery draining faster than usual.

How much RAM do I get for my app?

While Apple never disclosed any discrete number in terms of “RAM budget” allocated to your app, this StackOverflow post can give us an idea in terms of healthy memory boundaries to not cross.

They used a small utility that allocates as much memory as possible in order to trigger a crash and record the amount of memory used then. According to the following chart, it seems like 50% of the total device RAM amount is a good number to not cross:

What causes an OOM crash?

There are multiple reasons that can cause the heap memory to be over-bloated and lead to an OOM crash:

Retain cycles

This was the main cause of OOM crashes in the Compass app. In fact, this cause is so vast and impactful that it deserved its own separate article.

Caching

Caching can be a vital asset to developers who are dealing with frequently accessed objects that require significant memory or computation time. Although providing enormous benefits in terms of performance, caching can use very large amounts of memory. It is possible to cache so many objects that there is no RAM left for your or other applications, potentially forcing the system to terminate them.

Note that images is a common example of large objects being cached and responsible for memory overloads.

Images

Image rendering can be expensive in terms of memory. The process is split into 2 phases: Decoding and Rendering.

The Decoding Phase is the transformation of image data (data buffer) into information that can be interpreted by the display hardware (image buffer). This includes color and transparency for each pixel.

The Rendering Phase is when the hardware consumes the image buffer and actually “paints” it on screen.

Let’s take an example. You want to display 350 pixels squared photos in a UICollectionView. The photos have been taken by an iPhone X (12MP). The data representation of such photo is loaded into memory in the image buffer, which size is calculated by multiplying the number of bytes per pixel by the width and height of the image.

The photo is 3024 x 4032 pixels, the Color Space is RGB with DisplayP3 Color Profile. This color profile takes 16 bits per pixel. Therefore, such a photo invoked using UIImage(named:) will take roughly 3024 x 4032 x 16 bits. That is 3024 x 4032 x 8 bytes ≈ 93.02 MB. This is just for one image, among other images in a collection view.

One common case where this might overuse memory is on iPad. Because they have a large screen and often a high resolution, the natural reflex is to display more images with higher quality. Well, some iPads actually have less memory and poorer specs than iPhones. This extra imagery can ultimately lead to the system terminating your app.

Increase Visibility

We have faced multiple situations in the past few months where our QA team encountered a crash that never resurfaced in our crash reporting board. This led us to wonder, how much visibility do we actually have around crashes in our app? How many crashes do we not know about?

Crash Reporting Tool

At Compass, we use the crash reporting service Crashlytics. For context, Crashlytics was acquired by Google early 2018, who integrated it into their Firebase suite of tools. Previously owned by Twitter, it was part of the Fabric platform that Google is planning to abandon in March 2020.

While Fabric released a feature that detects OOM crashes in August 2016, it was never migrated to Crashlytics after the acquisition, preventing us from opening the OOM black box. At the time of writing these lines, the Google team has no intention to add this feature back into Crashlytics.

Attempt to track OOM crashes

Since Compass is no longer relying on Fabric and its OOM crash reporting feature, we spent some time researching for a similar alternative to use or implement. We found out that this technic was actually initially published by Facebook here, later adopted by Fabric.

The algorithm consists of identifying why the app is starting, based on under what circumstances the last session was terminated. It consists in a sequence of inference and elimination steps which, if none of them is identified as the reason why the app closed, should leave us with only one possible cause left: out-of-memory crash.

Note that this algorithm allows to make the distinction between Foreground Out-Of-Memory (FOOM) and Background Out-Of-Memory (BOOM):

While this clever and systematic approach seemed promising, it was also reputed to be somewhat unreliable, prone to false positives, and we suspect this might be the reason why Google has decided to not port this feature to Crashlytics. Therefore, we decided at Compass to look for other ways to increase visibility in OOM crashes.

For the exhaustiveness of the article, we decided to include this section for our readers that might be interested in exploring this option for their app. Note that an Objective-c library implementing this logic is available here.

Track low memory warnings

Instead, and with in mind that Memory Warnings are triggered when the app will be terminated by the system, we decided to track them in order to gain more visibility on memory pressure issues:

It will produce a dictionary like ["memory": "7080 MB"] that can be logged as an Analytics event, or can be persisted locally and sent to your server later (e.g. next time the app is launched). In the latter case, it can be a good idea to tie this to a user id in order to know the number of users impacted.

Pros:

  • Increases visibility on number of users impacted by memory pressure and how many times it happens every day
  • Can populate an “app health” dashboard and show evolution and potentially spikes over time

Cons:

  • Low memory warnings don’t always result in a crash
  • Low memory warnings are not always triggered before OOM crash
  • Memory stress triggering this warning can be caused by another app

This Analytics event, triggered every time the warning is fired, enabled us to draw the following graph showing the # of memory warnings per day.

It is important to remember that Low Memory Warnings is not an exact science (see “cons” mentioned above). However, if monitored over a long period of time, it should still help us to identify curves, trends, and potentially spikes revealing new OOM crashes.

Raw data represented in graph (# events per day)

Despite fixing a large amount of OOM potential causes and our users reporting app reliability improvements, we could not see an obvious trend of crashes going down within almost 2 months of recording data. This approach did not prove to be ground breaking at this point, giving it more time might be needed to reveal higher level trends.

Improvement ideas:

  • Log tree of UIViewController subclasses currently alive in navigation stack when the warning is triggered, in order to gain more granular insight as to where the memory pressure possibly comes from
  • Alternatively, we could also fire this event from the function didReceiveLowMemoryWarning in UIViewController passing the name of the controller in the list of properties. All controllers alive when the warning is triggered will fire the event. The events can then be sorted by controller count allowing to investigate the ones ranking at the top of the list

Track retain cycles

Another metric we decided to track is the number of retain cycles in our app. We know that retain cycles are the main root cause for memory leaks and OOM crashes. There is consequently a suggested correlation between # retain cycles and # OOM crashes. Tracking them and fixing all of them should technically almost completely eradicate these crashes.

Every time a retain cycle is detected, the app sends an Analytics event that allowed us to draw the following chart. Monitoring this chart over time (especially after app releases) proved to be the most helpful way to check our memory health to date, helping us to identify and fix memory leaks.

Pros:

  • Tracks all retain cycles before the app crashes, no loss of data
  • Good level of insight about where the issue lives, by sending the name of the object as property of the event directly

Cons:

  • Tracks only one root cause of OOM crashes: retain cycles. This doesn’t track caching, image or any other memory pressure potential issue

We will describe later in this article how to catch retain cycles at runtime and trigger an Analytic event to create the above graph.

Crash Curation

While infinite loops can be relatively easy to reproduce, other crashes like those due to leaks and retain cycles can be particularly challenging to track / identify the steps to trigger the faulty piece of code.

Catch retain cycles

Catching retain cycles is not straight forward task, it is not automatic, and it is not well integrated in our everyday workflow. Instead, it will require to put on your investigator hat. You will need to spend research dedicated time, monitor performances, run instruments and analyze their results, among others.

At Compass, we have updated our workflow so that every developer can spend around half a day every sprint to track and fix retain cycles. Here are some of the tools we use:

Debug navigator:

When a View Controller is not deallocated, its memory footprint persists after you pop or dismiss it from the navigation stack. The debug navigator is a good tool to catch this behavior.

We tried to perform a push and pop on a view controller known for running CPU-intensive operations, 4 times in a row. See our results below with and without a memory leak. In the former case, the Usage over time graph draws a beautiful “stairway” of memory usage, typical smell of retain cycle.

Retain Cycle:

Note the unhealthy percentage used of 180%. Note the 4 “stairs” of memory usage adding up every time a new view controller is loaded.

Normal:

Note the healthy and steady percentage used of 72%. Note the 4 spikes of memory usage going back to normal every time a new view controller is loaded.

Debug memory graph:

Presented for the first time at the WWDC 2017, the debug memory graph is a redoubtable way to catch memory leaks.

At any point when running your app, open the debug memory graph and inspect the objects that are currently alive in memory (left panel). You will be looking for:

View Controllers that are no longer displayed on screen because they have been popped or dismissed

  • View Controllers with a higher count than 1 meaning that there are at this moment as many instances alive as this number (they are retained)
  • This doesn’t restrict to View Controllers only, don’t hesitate to extend your research to other objects, notably those known for being heavy

Retain Cycle:

Note the (4) instances of the same View Controller retained alive simultaneously, and the closed loop in the right graph representing the reference cycle.

Normal:

Note the only (1) instance of the same View Controller retained alive, and the open graph not containing any closed reference cycle.

Instruments: Allocations

The Allocations instrument gives you detailed information about all the objects that are being created and the memory that backs them.

There are 2 tricks that will make your investigation super efficient:

  • Use the bottom left Filter text field to narrow down the results. Type in the name of your module to display only its specific objects (and filter out any obscure object owned by other modules). This is a fantastic way to highlight View Controllers only for example.
  • Check the # persistent corresponding to the number of instances currently retained in memory, pay attention to any number greater than 1 and ask yourself “does it make sense that several instances of this object are alive right now?”.

Retain Cycle:

Use the filter text field at the bottom left to narrow down results. Notice once again the “stairs” of allocations, as well as the count of 4 persistent view controllers.

Normal:

Note there are no “stairs” of allocations, and the count of 1 persistent view controller (even if opened/closed 4 times in a row)

Another great feature of Allocations is the ability to mark generations: every time you create a new generation, Instruments will tag all of the objects allocated since the last one. For example, you might want to check a screen that allocates a bunch of objects (e.g. UIViewControllers, views, and other custom classes) and expect them to be released when the screen is dismissed. “Generations” make it easy to narrow down to only the allocations marked between those two generations.

Retain Cycle:

Click the arrow next to any generation name to focus the analysis on just the objects created during that generation. Note 2 view controllers still persistent #1 after having been popped.

Normal:

Instruments: Memory Leaks

The Memory Leaks instrument is another great tool to catch retain cycles. Start recording, navigate through the flows of your app and it will alert you whenever it finds leaks. With a little bit of investigation, you will identify where those leaks are coming from and be able to fix them.

Use the bottom left Filter text field again to filter out on your app-specific objects. We recommend filtering on ViewController to pinpoint retain cycles.

Retain Cycle:

Use the bottom left Filter text field to narrow down to View Controllers (or other objects). Note the “stairs” of memory, and tool reported leaks after 45 seconds.

Normal:

The tool did not report any memory leak, note that there are no “stairs” of memory usage.

Note: it is recommended to use the instruments on a physical device and in Release mode in order to test in conditions as close to the reality as possible.

Implement deinit

Once you have a clearer idea on what View Controller (or other object) is being retained and responsible for a memory leak, you can manually implement its deinit method and set a breakpoint in order to check if it is called when the object should be deallocated (e.g. pop, dismiss, set to nil, etc.).

This will enable you to look for the potential cause and check until the breakpoint is hit and the object correctly deallocated.

With all those mechanisms in place, we have been able to catch over 100 retain cycles forcing 27 view controllers and other heavy objects to be retained in memory indefinitely. We caught them in about 3 months time. Some of them had been there for over a year. While it is not possible to assess how many crashes this group of leaks caused, it has been confirmed that some of them could crash the app by simply opening the “infected” View Controller 8–10 times!

Crash Prevention

Ok, finding and fixing existing memory leaks is great. But this left us with a sour taste at Compass. We didn’t feel fully satisfied, we didn’t want to stop there. Everyone knows the expression “better safe than sorry”, we decided to follow this wise phrase and come up with ways to detect any potential cause of OOM at development time—before the application gets into the hands of our users.

Catch retain cycles

1 - Our most efficient way to catch a wide range of retain cycles was to create a MemoryChecker class. It implements a static function that checks if a given object has been deallocated within x seconds after the function was called. If the object has not been deallocated, there is a retain cycle. If the application is running in DEBUG then a fatalError will crash it and log the name of the culprit class:

This function can be called in various relevant and useful places such as:

a) Override UINavigationController and check if a UIViewController is properly deallocated on pop:

b) Override UINavigationController and check if all of its viewControllers are properly deallocated when the Navigation Controller itself is deallocated (e.g. on dismiss, when set to nil, etc.)

This technic is only possible because UINavigationController does not keep a strong reference to the objects contained in its viewControllers property (see details here).

c) Override UITableView and check if its cells are properly deallocated when the view is deallocated (e.g. the View Controller containing it is deallocated):

Naturally, this also applies to UICollectionView :

Note that we had to create a wrapper WeakBox object keeping a weak reference to the cells the new allCells property we created contains, so that it doesn’t increment their reference count.

Note: the above examples are what we implemented in our codebase, but this list is of course not exhaustive and you might find different cases that are more appropriate for your app.

2 - Set a useful Xcode breakpoint that plays a pop sound (and prints a log) every time a View Controller is deallocated:

Source: Cédric Luthi who shared this awesome trick with the community

3 - Manually implement deinit and print a log when it is called. If we ever find a situation where the log is not displayed, there is something odd happening. Note that a breakpoint can be manually set on this line as well.

4 - Unit Test leaks. While this technic might not cover all situations your View Controller can go through, it is a good way to automate catching retain cycles and integrate it in your CI pipeline:

5 - Linters relying on regexes can help you detect some of the retain cycles at compile time, for instance delegate variable declarations that are not annotated weak.

6 - Use retain cycle tracker libraries like LifetimeTracker. They detect leaks at run-time and make them visible, if you know how many instances of a certain object must be alive at a time. While known for being a great tool to have running in development mode, we did not take this approach at Compass.

Use caching cautiously

If you don’t need to implement custom caching, Apple recommends using NSCache since it internally implements smart mechanisms that will purge your cache when the system is running out of memory.

If your app requires a more sophisticated caching implementation, be mindful of setting explicit limits for the amount of data that can be persisted, and implement some logic to purge old objects so this limit is never trespassed.

Be mindful when loading images

If your app needs to display lots of large, high resolution images, keep in mind that the image buffer used to render them on screen could generate a high memory consumption.

There are multiple ways to lower the memory impact of displaying images. This excellent article describes in details number of these alternatives.

Add defensive code

While fixing the root cause of a potential issue should always be preferred, it can be wise to add client defensive code around some logic that you know might be at risk.

For instance before adding annotations to a Map, check that this number is not so excessively high that it could cause a crash. Handle the situation differently with defensive code whenever needed (clustering, error message, display subset only, etc).

Handle low memory warnings

This can be another good place to purge some cached data getting out of control or restrict an overuse of image processing:

1 - Override applicationDidReceiveMemoryWarning in your app delegate to be notified at the application level when your app places too much pressure on the system memory

2 - Override didReceiveMemoryWarning in any view controller subclass for more targeted notifications and local handling, especially if you know of any particular view controller that may cause problems

3 - Subscribe to UIApplicationDidReceiveMemoryWarningNotification to catch memory warnings from any other place in your app

Test on actual device

While using the iOS simulator is convenient in our daily workflow, keep in mind that it is still a simulator. It may not accurately represent memory use in your system and how it might perform under memory pressure on a real device.

In fact, the Simulator uses Mac hardware and memory. You can simulate a “memory warning” to test if your response to that warning behaves correctly. Other than that, you should really be testing memory use on a physical device.

For instance, we have created a View Controller performing extremely CPU-intensive allocation. We purposely introduced a retain cycle and opened it several times on simulator vs. actual device. In the former case, the Mac got very slow but the app on simulator was still running. In the latter case, the app crashed.

Conclusion

OOM crashes were a blind spot for us because there is no formal way of observing these events and their frequency. No one likes it when an app suddenly closes. With some tooling, research and a bit of cleverness, we were able to make our app more reliable and ensure that it wouldn’t suddenly close while you were opening a web view to an interesting article — like this one!

Compass is hiring! Discover more at https://www.compass.com/careers/

--

--