The new gold standard for iOS releases: 99.99% crash-free 🏆

Eric Coffey Woods
Turo Engineering
Published in
8 min readDec 20, 2018

I’ve had the pleasure to work on the Turo iOS app ever since version 1.0 was released on the App Store back in September 2012. A lot has changed in that time, but I am particularly proud of the progress the iOS team has made over the past year on improving our crash rates.

About a year ago, the iOS team was a strong but lean 5 engineers, and our crash rate was steadily improving (from around 99% crash-free sessions to around 99.5% crash-free sessions). One proposed goal for the team was to codify our new low crash rate for all new releases:

99.5% crash-free sessions, 99% crash-free users (last 7 days)

I shared this quote with the iOS team recently, because in hindsight this goal seems remarkably facile. I’d like to walk through how we were able to not just meet but surpass these goals, and take a deep dive into a particularly tricky crash we had to track down this year.

Code review

Preventing crashes often starts with quality code review. It is both the responsibility of the developer of a feature and the code reviewers to ensure that we aren’t introducing new crashes. This isn’t limited to just crashes today, but crashes in the future. Engineers need to consider future uses of their code, or how changes to our API responses might change in the future. Defensive programming is the key to a sustainable codebase.

These are a few questions we often ask each other during code review:

  • Should this be a weak reference? Could it lead to retain cycles, especially with delegates and completion blocks?
  • Could this variable ever be nil? How do we handle that case gracefully?
  • Is there duplicated logic that should be abstracted?
  • Magic numbers — can this be defined as a constant? Or you could you use make use of NS_ENUM? The act of giving a name to a constant can often shed light on the fragility of code, or provide clearer context to the logic.
  • Literal strings — for KVO you could make use of NSStringFromSelector. Can we use NSStringFromClass as storyboard identifier? Is this user-facing, should it be localized?

Read more about how to Polish Your PRs đź’… + Rev Up Your Reviews by my colleague Catherine Patchell.

Monitoring

We’ve been using Crashlytics to monitor crash rates, and been tracking our progress informally over the years. All of our iOS engineers share the responsibility to check in on the latest release, especially to see if the features they’ve built introduced any new crashes.

One of the best tools Apple gives us for monitoring crashes is the Phased Release. This allows us to monitor releases before they reach all of our existing users, and pause the release if we see anything problematic.

More than just crashes, we also monitor warnings about unexpected conditions . Many times the easiest fix for a crash may leave the user in a weird or undesirable state, such as unexpectedly logging out the user, or displaying an error message. This might be better than a crash, but it’s still important to track how the user got into this state, and how often it’s happening. Tools like Crashlytics and New Relic both support these kinds of reports, so consider tracking these situations when adding a crash fix. Sometimes they can even give you the breadcrumbs necessary to track down tricky crashes.

Picking your battles

I think a key part of actually fixing crashes is having empathy for the user. A crash is really one of the worst possible user experiences you can have on a mobile app, but not all crashes are created equal. If your app crashes in the background, it’s possible the user doesn’t even realize it.

You also need to assess where in the flow the user crashed, was it critical for revenue, like checkout? Did the same user continue crashing as they tried the same action again and again? Signals like these are important in prioritizing which crashes to work on first.

UIFeedbackGenerator

Sometimes crashes actually aren’t under your own control. By summer 2018, one of our biggest remaining crashes was attributed to UIFeedbackGenerator.

Exception raised while auto-deactivating <UIImpactFeedbackGenerator: 0x1d4314010: prepared=0> for style 2: force deactivating <UIImpactFeedbackGenerator: 0x1d4314010: prepared=0> with style TurnOn which is not active (activationCount = -1) configuration: <_UIImpactFeedbackGeneratorConfiguration: 0x1d46a7da0: isEnabled=1, activationStyle=2, requiredSupportLevel=2> activationCount: -1, styleActivationCount: -1 engines: {( <_UIFeedbackHapticOnlyEngine: 0x1d02d6dc0: state=4, numberOfClients=1, prewarmCount=0, _isSuspended=0> )}

This was quite mysterious, given that we don’t use UIImpactFeedbackGenerator at all, and neither do any of our third party libraries.

Fatal Exception: NSInternalInconsistencyException
0 CoreFoundation 0x183f3ed8c __exceptionPreprocess
1 libobjc.A.dylib 0x1830f85ec objc_exception_throw
2 CoreFoundation 0x183f3ec6c -[NSException initWithCoder:]
3 UIKit 0x18e58fec0 -[UIFeedbackGenerator _autoDeactivate]
4 UIKit 0x18e58fcfc __48-[UIFeedbackGenerator _setupAutoDeactivateTimer]_block_invoke
5 libdispatch.dylib 0x183830ae4 _dispatch_client_callout
6 libdispatch.dylib 0x18386d7a8 _dispatch_continuation_pop$VARIANT$armv81
7 libdispatch.dylib 0x183876c20 _dispatch_source_invoke$VARIANT$armv81
8 libdispatch.dylib 0x183871c44 _dispatch_main_queue_callback_4CF$VARIANT$armv81
9 CoreFoundation 0x183ee7070 __CFRUNLOOP_IS_SERVICING_THE_MAIN_DISPATCH_QUEUE__
10 CoreFoundation 0x183ee4bc8 __CFRunLoopRun
11 CoreFoundation 0x183e04da8 CFRunLoopRunSpecific
12 GraphicsServices 0x185de7020 GSEventRunModal
13 UIKit 0x18dde578c UIApplicationMain
14 Turo 0x1041a8ff4 main (main.m:37)
15 libdyld.dylib 0x183895fc0 start

After hitting our heads against the wall with this UIFeedbackGenerator crash, we noticed during the iOS 12 beta period that no crashes were being reported on the new OS. And once iOS 12 had widespread adoption this fall, we were able to confirm that this crash was no longer an issue. Although it’s easy to assume a crash isn’t your fault, it’s important to not get too complacent in that mindset, as our next example will show.

ERROR_CGDataProvider_BufferIsNotReadable

The biggest crash we had in our app, was fairly mysterious for quite a while. We were never able to reproduce it locally, and because our own code was no where to be found in the stack trace, we had been attributing it to Apple:

0  CoreGraphics                   0x183b58828 ERROR_CGDataProvider_BufferIsNotReadable + 16
1 CoreGraphics 0x183b58548 CGDataProviderRetainBytePtr + 216
2 QuartzCore 0x186226c20 CA::Render::(anonymous namespace)::create_image_from_image_data(CGImage*, CGColorSpace*, unsigned int, unsigned int, double) + 196
3 QuartzCore 0x186224e58 CA::Render::create_image(CGImage*, CGColorSpace*, unsigned int, double) + 868
4 QuartzCore 0x1862279e0 CA::Render::copy_image(CGImage*, CGColorSpace*, unsigned int, double, double) + 472
5 QuartzCore 0x186227da4 CA::Render::prepare_image(CGImage*, CGColorSpace*, unsigned int, double) + 20
6 QuartzCore 0x186337b0c CA::Layer::prepare_commit(CA::Transaction*) + 420
7 QuartzCore 0x186299ba0 CA::Context::commit_transaction(CA::Transaction*) + 576
8 QuartzCore 0x1862c15d0 CA::Transaction::commit() + 580
9 QuartzCore 0x1862c2450 CA::Transaction::observer_callback(__CFRunLoopObserver*, unsigned long, void*) + 92
10 CoreFoundation 0x18215a910 __CFRUNLOOP_IS_CALLING_OUT_TO_AN_OBSERVER_CALLBACK_FUNCTION__ + 32
11 CoreFoundation 0x182158238 __CFRunLoopDoObservers + 412
12 CoreFoundation 0x182158884 __CFRunLoopRun + 1436
13 CoreFoundation 0x182078da8 CFRunLoopRunSpecific + 552
14 GraphicsServices 0x18405b020 GSEventRunModal + 100
15 UIKit 0x18c05978c UIApplicationMain + 236
16 Turo 0x1020340a4 main (main.m:37)
17 libdyld.dylib 0x181b09fc0 start + 4

A few members of our iOS team were lucky enough to make it to WWDC this year, where Apple sets up engineering hours with their employees to help debug issues like this. This is an extremely valuable resource, so prepare questions about with specific issues you’d like to get resolved.

Looking at the stack trace above, we met with Apple’s engineers from the CoreGraphics team who walked through our crash logs with us, and verify that, although the stack-trace contained some calls about copy_image, that our image loading calls looked safe, and suggested that the real cause must be a memory leak somewhere else that was causing corrupted data.

The advice from Apple’s engineers had given us some important hints, but it was up to us to find the fix. Using Instruments we detected a few memory leaks when display vehicle information to travelers, but we hadn’t yet dedicated the resources to track them down. This is one of most complicated parts of our app, with over a dozen ViewControllers working in conjunction to display information about vehicles.

At WWDC we also had the opportunity to learn more about Xcode’s visual debugging tools. Instruments’ Leaks is great to determine where in the app the leaks were happening, but to find the actual cause of the issue it seemed that the Memory Graph would be the best tool for the job.

Watch out for those purple alerts!

After pausing the application with Memory Graph, I was able to find a complicated retain cycle in one of networking calls, seen in this list of NSOperations on the left, where Xcode calls out retain cycles with the purple exclamation mark.

Below you can see all the strong references these NSBlockOperations have to one another.

Retain cycle!
Stack trace in the inspector

Using the inspector on the right, we can see the stack trace of where this memory was allocated. As you can see above, the reference count of the NSOperation will never reach zero, and this memory can never be freed up by ARC, and resulting in a leak.

Using a weak reference to the operation, instead of the default strong reference, unties the retain cycle, and thousands of crashes were solved with by changing a couple lines of code.

For more info on Xcode’s Memory Graph, check out the video on Visual Debugging from WWDC 2016 (at 24:45) and Pete Smith’s blog post.

Third party libraries

One last point about crash rates I wanted to touch upon is dealing with third party libraries.

For open source projects, it’s important to make sure you’re picking libraries with frequent active development that will keep up with new iOS releases and respond to issues that developers find. The other great thing about open source is that you can fix the issue yourself and contribute back to the repository. We had to cut ties with an image browsing library that had gone stale and not been updated in a couple of years as a part of this effort (and also wasn’t fully supporting iPhone X ). Ultimately we ended up building our image browser ourselves, which also gave us a lot more flexibility on the UX side.

Third party vendors can sometimes be even more of a challenge, since you have no access to their source code, but we had great success working with ThreatMetrix earlier this fall. We had been using at out-dated version of their SDK, and although the upgrade fixed most of the crashes we were seeing, there were a couple new ones introduced in the newest version. We sent crash logs over to their engineering support team, and they were able to have a new release for us within a week that fixed all of our issues!

Results

We’re pleased to announce our new gold standard is > 99.9% crash-free sessions and > 99.95% crash-free users over the past 7 days for our new releases:

So many 9s!

This has significantly increased the signal-to-noise ratio — allowing new issues to appear on our radar very quickly without getting lost in a sea of known issues. We can then pause a phased release before autoupdates have spread to the majority of our users, assess the severity, and fix the problem quickly. We’re continuing to raise the bar, and pushing further towards 99.99% crash-free with every release.

--

--