Recently I came across an article from The Verge, reporting that a Google app was downloading excessive data, even when the respective app was closed or on the background. That reminded me I should share our own story and perhaps someone can learn from it — or just leave it as a future reminder for myself.
We have a set of Android applications with around 400k monthly active users, and all was well until we received a report like this:
First reaction was… WTF!
What could be causing such behaviour? We looked thoroughly into our code, our UI layer, our network layer, etc. We looked into every 3rd-party service doing background work or making network requests. We looked into any component registering wake-locks or background services. We had some suspicions, but nothing out of the ordinary popped up.
We were pretty sure our code was safe and focused on the many 3rd party SDKs we currently have. To provide some context, just for advertising and analytics, we have ~14 different SDKs. It sucks, but they’re required to meet business requirements and generate revenue 🤷♂️.
(It’s hard to keep your app clean when working in a big company that requires integration with multiple services and you don’t control everything that gets inside your app.)
Weeks went by and no one else complained, so we closed one eye and hoped it was a one-time situation that wouldn’t happen again…
We were wrong (obviously) and other complaints appeared. We tried to reproduce it, but never succeeded. We tried different devices, but never succeeded. We looked at every log, we used Android Studio Network Profiler, we installed network monitoring apps, we stress tested the app, we captured bug reports and analysed the device using the Battery Historian tool. We could not reproduce it. We asked a few Google engineers for help, but there wasn’t much they could do… We had to look into our codebase and keep digging. (Although by now, we were running out of ideas.)
The situation lived on for some months. At this point we knew it was not a frequent bug: sometimes we had a couple of complaints per month, other times we had months without any complaint at all. When it happened, though, it was a serious problem. We had to take a step back, re-evaluate all possible culprits and make a new plan.
Part of this plan was to implement our own network monitor. We wanted to track how much data our app was consuming, per user. Android has some APIs to help us do that. Once we had this network monitor in place, we could could do some A/B testing and slowly start excluding specific components of the app.
And that was a good call. We still didn’t know what the problem was, but we could now have a better idea of how much data was transferred on average per user session, on which devices, etc.
We began to notice users on some Samsung devices typically transferred more data than others, so we began to manually test more often on those specific devices. Soon after, we noticed a large percentage occurred on a combination of Samsung S7 running Android 7.0 (not Android 7.1, not Android 6.0). And then, in one particular day, Rob Slama was randomly testing the app and found a way to reproduce the data leak.
It happened on many devices, just more frequently on that combination of device and android version. And you may ask… after all, what was causing the data leak ?
In an attempt to clear resources obtained by a 3rd-party SDK, we had this piece of code inside an Activity#onDestroy:
Which (confusingly) caused this library to keep holding resources for some reason. We removed it, updated the library to the latest version, and problem solved.
In our defence, there was little to no documentation, and this is a terrible design decision — in my opinion, naming a method ‘off’ implies stopping whatever the object was doing. Anyway, in the end, it was our mistake for incorrectly integrating the library. Lesson learned (the hard way).
As a final thought, months before our issue appeared, we reported a memory leak to the developers of this same library. The memory leak was never fixed as it was ‘not a priority’ for the 3rd-party library. Had we insisted on a fix, perhaps our problem would’ve never existed.
A part of me still thinks I did a terrible job by not fixing this issue anytime sooner. Another part is still relieved that we actually fixed it. Some problems are not taught in classes or available on Stack Overflow — and you just have to adapt. I like to think that’s when you grow ;)
I hope you enjoyed this article, feel free to share some feedback in the comments =)
Follow me at https://twitter.com/belchii