Conquering Our Android Crash Count

Strava Engineering
strava-engineering
Published in
7 min readNov 13, 2014

At Strava, we strive to produce quality software to serve and motivate the world’s athletes. As part of that mission, we are constantly updating and refining our user experience. And few things make for a worse user experience than the app crashing.

We’ve all experienced a crash at some point — the app suddenly disappears and is replaced by the system error dialog. This is jarring and interrupts your flow: it forces you to stop mid-thought, navigate back to where you were in the app, and finally restart what you were originally doing. Additionally, if you entered the app from a transient call-to-action (e.g. a push notification), it can be difficult or even impossible to find your place again. As developers, we try to avoid crashes in the first place, and to quickly find & fix those that slip through.

For a long time, the Android team at Strava has been tracking crash statistics via the Google Play Store. When Strava crashes, the phone displays a prompt to either report or ignore the crash; reporting it sends the stack trace and other non-identifying device information to Google, where we can examine it. Our crash counts looked reasonable — a few hundred per day across 200,000 daily sessions — but we suspected this wasn’t the whole story. To get a clearer picture, and collect data for crashes that weren’t being reported, we integrated Crashlytics — a crash reporting service owned by Twitter. We were in for a shock.

New Numbers

Our first major release with Crashlytics was Strava 4.1.6, which included updates to Training Videos and assorted bug fixes. We released the morning of Thursday 24 July, and kept an eye on both Google Play and Crashlytics to make sure we hadn’t introduced any unexpected show-stopping bugs. There weren’t any of the latter, but we were unpleasantly surprised to see nearly 8,000 crashes reported in the first 3 days. The crash rate only accelerated from there, putting us over 42,000 crashes after a week. After the initial ramp-up, as users upgraded, we averaged around 7,000 crashes per weekday and 8,500 on the weekends — an unacceptably high count for a company as focused on quality as Strava.

Crashlytics reported that 95–96% of the athletes using Strava on any given day were crash-free. In other words, the other 5% that day would experience a crash. App instability annoys users, generates poor reviews and increases churn rate.

We had never seen many of the crashes that our athletes were experiencing, and they often did not appear on the top pages of the Play Store crash list. For instance:

  • The most prevalent crash, with 20,000 occurrences in the first week, was an IllegalStateException in the Android Text-to-Speech (TTS) system: the platform was incorrectly reporting successful binding to the TTS service. We’d never seen it before (nor reproduced since) but could eliminate close to 50% of our crashes by catching and logging the exception.
  • Further 3,500 crashes were due to fragment state loss in a commonly-viewed Activity.
  • A number of other smaller crashes were due to not checking that a Fragment was attached to an Activity in an asynchronous Receiver before using the result of Fragment#getActivity().

Plugging the Leaks

On Thursday 14 August, three weeks after releasing Strava 4.1.6, we quietly unveiled Strava 4.1.7, a bug fix release targeting the worst crashes of Strava 4.1.6. A week later, our numbers were a lot more encouraging:

  • 55% fewer crashes than 4.1.6: 4,500 in the first 3 days and 18,000 after a week.
  • Averaged 3,000 crashes each weekday (after an initial ramp-up) and 6,000 each weekend day, about 3,000 less than 4.1.6.
  • 1–2% of athletes experienced a crash on any given day.

It’s difficult to overstate the importance of this 3–4% improvement. It represents 6,000–8,000 fewer sessions that end in a crash: 6,000–8,000 more daily opportunities for us to impress our athletes instead of disappointing or alienating them.

The most common remaining crashes in Strava 4.1.7 were overwhelmingly OutOfMemoryErrors (OOMs). The three most common crashes, totaling 8,000 of the 18,000, were OOMs. There was still a lot of work left to be done, but now we were facing the same issues that we knew every other Android developer struggled with: a system with limited resources and aggressive memory management.

Small Fix, Big Impact

Since releasing a redesigned feed in Strava 4.0 this past March, our app has had a larger memory footprint. We’ve tried reducing the size of our remote image cache, but this did not result in any significant reduction in OOMs. We’ve made a number of other small tweaks: lazy-loading drawable resources, removing duplicate drawable loads, more aggressively recycling bitmaps, reducing common object instantiations; but OOMs kept occurring.

We observed that 25–30% of OOMs came from Google Play Services Maps. This is a known issue, and particularly affects newer devices. Since we don’t have control over Play Services image loading and caching, we needed to take a different approach.

To address our OOM problem from another angle, we set largeHeap=true [1] in our application manifest. This causes Android devices running API 11 (Honeycomb) and above to increase the maximum heap size when the application is launched. This performed well in internal testing, so we rolled out to 5% of our install base to see how well it would work in production. The signs were good, so we added it to our latest release, Strava 4.2.0.

The Present

On Tuesday 30 September, we released Strava 4.2.0, which included a redesign of the profile screen, weekly progress & goals, run and ride auto-pause, a home screen widget, and a new ongoing recording notification, as well as largeHeap=true. Despite many exciting changes, this release has been our most stable yet, with only 3,200 crashes in the first 3 days and 12,000 a week later (around 1,500/day) — down 71% from 4.1.6.

Today, over 99% of athletes using Strava on a given day do not experience a crash. The most common crash (accounting for 2,200 of the 12,000) is due to a known issue in a beta release of Android L. Our most prevalent OOM, at just over 1,000 occurrences, is now only the third most common crash. Among OOMs, totaling roughly 2,600 of the 12,000, 99% of occurrences are on legacy devices that do not support largeHeap. On the most common devices, we see almost 85% fewer crashes than 3 months ago.

The Future

While we’re proud of getting our crash count down, we still have a lot of work left to bring crash counts down even further. While some crashes are obscure and unavoidable, many more can still be prevented. It’s great that 99% of our athletes go crash-free on any given day, but we’d prefer to see 99.9% or even better than that. Crashlytics automatically prioritizes crashes by the number of occurrences; we’ll continue to pick off the worst offenders to make using Strava a more stable and enjoyable experience.

We’ve learned that the Play crash reporting system is limited. We missed tens of thousands of crashes a week, a difference that can make or break an app’s user experience. The data provided by Crashlytics has completely changed our perspective of our app’s stability, and given us new targets to work for.

Footnotes

[1] The Android documentation discourages the use of largeHeap, stating that it is better to fix the underlying memory problems rather than rely on the system to provide more memory. In particular, the documentation notes that largeHeap does not guarantee the system will allocate more memory: it is considered a hint that may be ignored. Our case is exceptional, though, because many of our OOMs are due to image upscaling performed by Google Play Services Maps. We rely heavily on maps and cannot control the memory usage of third-party code. Many more OOMs not directly caused by maps are often influenced by them — for instance, there may be less available heap space due to tile image caching. In addition, using largeHeap did not degrade app performance. More memory requires more expensive garbage collection, which can lead to dropped frames and a less-responsive user experience. In our testing, however, we did not see any more dropped frames than before, and the user interface was more responsive than ever.

Originally published at labs.strava.com by Byron Hood.

--

--