Preventing Repeated Crashes on Launch in Our SDK

Andrew
Specto
Published in
3 min readJan 12, 2021

Our goal at Specto is to help you improve your app’s performance. It should go without saying that our code should run as efficiently as possible, but we also take great pains to ensure we don’t adversely affect your app’s stability. Simply put, we don’t want to crash your app!

Last month, we gave you a peek at our efforts towards stability in our Android SDK with a post about our recently open sourced error-handling library: Belay. Today, let’s look at some things we do in our iOS SDK in the same spirit of reliability.

Launch time

When you initialize our SDK in your app, we perform some housekeeping: gather information about the environment, set up the filesystem and communicate with our backend servers. Some of this is necessarily synchronous, but some is lower-priority work we offload to low priority background queues. Performance 📈

What if something goes wrong during our launch, which happens during your own app’s launch? Many developers have at some point had to debug a crash on launch, and even the best engineering organizations have experienced it in widespread, persistent ways.

To avoid persistent crashes at launch, we do a quick check at the beginning of our setup to see if things worked last time as a heuristic for whether it is safe to try again. Two pieces of information help us answer this: whether the previous launch finished successfully, and if not, the reason it was interrupted.

Startup and shutdown

During any given launch, we write checkpoints after completing various pieces of work. If a subsequent launch sees that a previous launch left off at a crucial point, it may not be safe to try running that code again. Because we run the highest-priority work synchronously, that checkpoint is a pivotal milestone.

We write checkpoints while performing initialization so subsequent launches can know whether everything ran smoothly the last time, or if not, where problems occurred. Determining whether it’s safe to initialize, right at the beginning, is covered in the next diagram.

If we see that our critical section didn’t complete, we assume it encountered a problem. If it finished, but our lower priority work on the background queue did not, we look at why it might have been interrupted. If it was due to a crash, we won’t retry until we see an updated version of our SDK, the app or the operating system; if it was just a user backgrounding or terminating the app, initialization proceeds as normal.

Depending on the last checkpoint passed in the previous initialization attempt, we decide if it’s safe to try again based on a few pieces of information about the last termination and any differences in the current execution environment.

If we detect an unsafe situation, we disable our SDK and upload logs to our servers for alerting and diagnosis. But what if our logger or diagnostic uploader itself is causing a crash? The next launch will try to upload diagnostics again and we’re back in a crash loop. To mitigate this, we only attempt the diagnostic upload once; subsequent launches will silently disable themselves, waiting to be upgraded when we release a patch.

Cleared for launch

We’re trying to take SDK reliability practices to the next level. We can turn off problematic versions of our SDK with remote configuration, and our SDK can self-arrest if it thinks it’s in trouble. Being a good citizen in your app is a foundational part of our goal to help you increase performance.

If you’re interested in monitoring your app’s performance, check out what we’re building at Specto!

--

--

Andrew
Specto
Editor for

iOS @ Specto. Previously at Twitter/Crashlytics & Layer.