App Zombies: Feeding the iOS Deadlock

Dan Post
Motiv Engineering
Published in
4 min readApr 26, 2018

Hard things are hard

Prepare yourself. This is one of the biggest horrors of Motiv app engineers, the bane of our existence:

The app, not being useful.

We call this the “infinite spinner.” That, to be fair, is technically incorrect, as none of us have waited past the heat death of the universe to see if it resolves. It’s much more likely to be starvation, deadlock, or livelock.

We almost never see this issue. That doesn’t negate the user pain of having to wait, then decide it’s not going to fix itself, kill the app, launch it (waiting for the OS to completely evict the prior process), wait for it to load, and manually sync the ring and wait for the data to transfer and process, since the starvation probably prevented automatic background syncs from occurring.

This leads us to problem two: since the user kills the app, and it is not detected by a crash reporter, we have zero visibility into the frequency or root cause of this problem. Zilch. Nada. No visibility means prioritizing it and debugging it is impossible!

But wait. I have a screenshot of the behavior, and know that it happened. It turns out, seeing this on an alpha build and knowing what to do are actually sufficient! We don’t have to know the size of the iceberg at first, just knowing that there is one helps us steer around it and bring other specialty tools to bear. Let’s look at two different doors.

Door 0: Avoid the problem

Some architectures lend themselves to reducing certain classes of human errors.

For example, typed languages catch at compile time many types of bugs that would be runtime bugs in dynamic languages. Some environments may offer thread safety analysis.

However, exhaustively checking that all possibility of these class of problems has been eradicated is most likely equivalent to the halting problem, so it’s computationally infeasible to prove. Additionally, rewriting an entire system to solve one bug is a tough and highly speculative investment to make. (What if it doesn’t solve the problem?) So let’s leave this door closed for now.

Door 1: Sysdiagnose

There is a very useful, but little known, system feature in iOS called sysdiagnose. You trigger it on recent devices by pressing both volume buttons and the home button at the same time. Your phone should vibrate, indicating sysdiagnose has done its job. There are articles on how to use sysdiagnose, but nothing about how to get value out of it as a developer.

Sysdiagnose writes some files named stacks-YYYY-MM-DD-TTTTTT.ips. Once you sync with iTunes, you’ll find them in ~/Library/Logs/CrashReporter/MobileDevice/DEVICENAME/. These files contain a record of every single process in the system, each thread for each process, and each stack frame in each process! Sounds useful.

Here’s the downside: they only contain program counter offsets. Useless!

However, if you’ve been a good iOS engineer, you’ve archived all of your dSYMs from every alpha and external build somewhere permanent (in our case, in an S3 bucket).

Enter symbolicate-stackshot: a simple Ruby command-line utility to transmogrify your stack file plus the dSYM into something human-readable! Now, visibility!

Here is one of several threads stuck in similar places:

*******************************************************************

Process 1150

========================================

Thread 843020

0x000000010097d7c4 (in Motiv)

0x0000000100802498 (in Motiv)

0x0000000100802e88 (in Motiv)

0x00000001007f4b44 (in Motiv)

1836ceb4c

@objc ReadableMotivModelContext.performAtomicallySerially(readBlock:) (in Motiv) (ReadableMotivModelContext.swift:0)

-[MotivModelController performReadOnly:] (in Motiv) (MotivModelController.m:214)

+[MVDailyTargetService weeklyTargetForWeekContaingDate:forType:] (in Motiv) (MVDailyTargetService.m:81)

In our case, the primary failure revealed an issue with our CoreData setup that caused starvation of the original main thread and temporary saturation of background threads. Embarrassingly, it also caused several operations that should have been durable/synchronous database updates to be asynchronous.

From the commit message from one of our developers:

A simple execution of the following code would cause a deadlock

for _ in 0…50 {

Dispatch.global().async {

MotivModelController.instance.performAtomically { (context) in

//insert some object

}

}

}

The awkwardly small diff:

- parentContext.perform {

+ parentContext.performAndWait {

Door 2: Cerberus, the watchdog

The venerable watchdog is old hat to embedded programmers. The concept is simple: part of your system (usually strictly separated from the other parts of your system) keeps time, and when it reaches a predetermined time value, it resets the system. When your system is operating normally, it “kicks” the watchdog. If something prevents normal flow through, your system automatically recovers.

The watchdog is usually implemented in hardware, though some systems have software watchdogs. Hybrid systems can be powerful; a software watchdog, triggered by a timer in ISR context, can save diagnostic info when it works so you have information to debug from, and a hardware watchdog is your safety net from total and complete lockup. But I digress.

Let’s apply the watchdog concept to a mobile app.

For complex or top-level user operations (example: taps on ‘week’ view), we could start an NSThread that sleeps, then logs or crashes. Completion of the regular code path (data loading and refreshing the screen) would then cancel the watchdog thread.

We have selectively deployed a limited version of a watchdog that only logs initially, as it will bring visibility to problems we may otherwise only occasionally hear a non-actionable (no data to debug) report from a customer.

What to do if you make the watchdog too sensitive, and your users experience crashes far too frequently? Consider having a server-controlled parameter (feature flag) that switches the watchdog between logging and crashing. Also consider setting the timeout relatively high, as your users may navigate faster through your app than you anticipated or tested for, leading to a backlog of operations to complete!

Wrapping up

Check out our basic sysdiagnose stackshot symbolicator over here on GitHub. Pull requests welcome!

I hope you’ve learned a bit more about failure modes of multithreaded programs, and now have some tools for diagnosing or working around some of the most difficult class of bugs.

Motiv is hiring — if you want to work with us, check out our careers page!

--

--