Finding bugs in the long tail of your App’s crash report
The subject of my Masters thesis was a microarray data set from UCSF which captured the gene expression of tens of thousands of genes from placental tissue at various stages of fetal development.
We knew the data contained insights that would help focus the medical research striving towards identifying the cause of, and cure for preeclampsia and other complications affecting pregnancy. But we worried that some of the more subtle details were being lost because the size of the data precluded the expert biologists from participating in the analysis.
Humans vs. Computers
Data mining techniques provide a way to tease out information from large data sets. Our collaborators at UCSF had already published a paper rooted in clustering, complete with the pretty scatter plots that attempt to make the results less opaque. But data mining results are often hard to understand, hard to trace, and are heavily parameterized.
Humans on the other hand are terrible with large amounts of data, yet are good at spotting patterns and applying knowledge and experience. That is especially true of experts analysing their own large data sets. Computers are good at storing, recalling, and manipulating lots of data. Humans are good at extracting meaning from observations.
Combining the two, and letting humans lead an analysis of data, while letting the computers disguise the size and complexity of the data is where the magic happens. And that was my goal; writing software that enabled experts to explore and experience their data directly.
All of this was over 10 years ago, but the experience and philosophy has affected almost every aspect of my professional life.
Recently I had an experience with one of our Crashlytics integrations, which made me realize that their unconventional product focus is heading in exactly this direction, and it’s thrilling to see it being developed on such a huge scale.
We’ve been using Crashlytics (now part of Fabric, from Twitter) since our 2014 builds of the Official NFL Fantasy Football mobile Apps. Combined with Fabric’s Answers product, we’ve come to rely on it to get a daily feel for the adoption, performance, and usage of the mobile apps we build.
During that first season, we saw a spike in a specific application code bug, and Crashlytics fell short of actually enabling us to pin point the exact time of the first crash. And when we pressed on the issue, it was clear there were limitations to the questions we could get meaningful answers to, via Crashlytics. (Side note, they were actually super helpful via email — shout out Hemal Shah and team).
This was just one example of many times we found ourselves scratching our heads about apparent gaps in their product, thinking that conventional wisdom would have filled them long ago. But recently, it’s becoming clear that rather than providing exhaustive access to the data they’re collecting (like Google Analytics, for example), they realize that the real value of a platform such as theirs is in surfacing data in a way that the developer can identify problems, and begin to interpret, rationalize and solve them. It’s software, empowering developers to understand their application code across platform heterogeneity (different devices, OSs etc.) and countless other variables.
Listening to the data
Using Answers and Crashlytics, we know that push notifications for Fantasy Movie League lead directly to spikes in sessions, and those sessions decay rather gradually.
For World Surf League, we noticed that folks are self organizing to show up when Kelly Slater surfs a heat. We also noticed that crashes build with the populartity of event broadcasts.
And that there are some lingering iOS 8 issues in our long tail.
The most impressive application of Fabric came this week though with the rollout of OOM reporting:
For a long time I’ve suspected that many of our long tail of crashes — the 1–20 occurance type issues — were really just the app dumping out under OS or resource pressure. Here’s a snippet of our newly-available OOM crashes for the World Surf League:
I sent the link to David because we hadn’t seem OOM data flowing yet, and his first question was “what’s the deal with iPad Pro?!”. It’s easy to gloss over how powerful this insight is. iPad Pro is the best resourced iOS device in existence, but it’s got the most OOM crashes. It’s modest usage had, until now, hidden this particular feature.
This cut and presentation of data had identified an app-code issue that, as the developer, I could diagnose and fix. That is magic.
Explaining the feature (aka fixing the bug)
So what was different about iPad Pro? My thoughts turned to how we request and cache images. We rely on Fast Image Cache to maximize the fps while scrolling on the home feed, which basically takes images we load of a predetermined size and reads them in and out of a much larger in-memory image. We’ve been doing that for over a year, without any negative side effects.
The second component is that we re-write URLs to get images at an appropriate resolution and quality for the device. The thinking was that on iPhone 5, we should be getting images 320px wide, and on iPad, we should be getting larger images to look good on the larger screen. That logic extends to iPad pro, where the images getting requested are, well, yuge.
So, an in-memory cache that scales with the size of the images it’s initialized to contain. Bingo.
Don’t neglect your tail
It’s natural to focus on chasing your noisiest crashes, and the summary stats. But we’re increasingly able to identify and tackle app-code issues in the long tail, whose impact may be more modest, but no less satisfying to solve. I’m looking forward to rolling a build that will make the World Surf League more stable for our largest iOS screens.
And if your engineering team isn’t pouring over Crashlytics and Answers on a daily basis, encourage them to, because this is one of a growing number of tools that are becoming indispensable to a modern dev team.