Production Profiling of Mobile Applications With Specto
In 2006, Akamai ran a performance case study where they found that users will wait for a maximum of 4 seconds for a page to load before they develop a negative perception of the company. Later in 2009, they ran a similar study in where they found that the threshold had lowered to 2 seconds.
More than 10 years later, we are now at a point where performance is no longer optional — it is an expectation. It is also the reason why large companies (think Google, Facebook, Uber, Amazon, Netflix etc.) employ teams of highly skilled performance engineers to optimize their applications and it’s why this level of optimization is not always in reach of smaller companies.
There’s also a difference in the way users expect mobile and web applications to work. On mobile, the user tends to be more demanding when it comes to overall performance. Applications must launch in a blink of an eye and become interactive as quickly as possible.
Hermes is an ahead-of-time compile (AOT) engine, which means it’ll try to evaluate as much code as possible at compile-time, leaving the runtime to do the least amount of work possible — it is a great step forward, but we should also investigate what we can do at runtime to further optimize application performance.
The state of Mobile APMs today
We believe that the traditional mobile APM approach is flawed, as it rarely gives engineers the detailed level of visibility that is required to debug client-side performance issues. The value of such tools is often capped by the level of experience of the engineers using it — this is often demonstrated by engineers moving timers around a certain bottleneck until they can isolate the issue (if they are lucky). Hard to find issues can only happen in certain environments and contexts that we do not have immediate access to, meaning we need to release new application versions and hope for the best. Tools like this come at an expense of long fix turnaround times, paid for by the patience of your users.
This is where Specto is different from other mobile APMs — we give engineers function-level insight into their mobile application’s production environment. Because Specto profiles production traffic, it means the data you see in your dashboard directly reflects the user experience of your users. Additionally, performance issues can occur under very specific conditions (poor network, low battery, etc.), which makes them hard to replicate and easy to miss when using local performance profilers or testing in CI/device labs. Similarly, a single profile does not represent your users very well: a local profiler may execute a block of code in 100ms and we may discard this as something not worth optimizing — not knowing that the 75th percentile duration of that same block of code in our production environment could be 5x of what we are seeing locally.
Let’s look at some of the data from our Specto testing application that displays information about movies.
Finding a performance issue
At Specto, everything starts with a trace — our SDK will automatically start some for you, but you may start your own custom traces to instrument any part of your application. The Traces tab shows the latest traces we’ve ingested, and they are queryable across application versions, interactions, and different device attributes. Can you already spot some outliers?
Click on a trace from the chart to view a breakdown of the spans, call trees, and function durations for that trace. After Specto ingests the traces, we will compute aggregations for you — this allows you to compare data over time and spot regressions. The functions tab surfaces the most expensive code running in your application for a given interaction, this can be further filtered down by thread name along with any other specific device attributes. It makes it easy to compare data over time, validate improvements, or track down regressions.
An expensive image decoding problem
If you look at the screenshot above, the interaction we are querying for is called load-movie-details. In practice, this happens when a user clicks on a particular movie from a feed, and the application requests data from a web API and renders a few images.
If we sort our view by descending duration, we can see that the top 5 functions executed during this interaction all belong to the Kingfisher library — this is the library used by our application to fetch and render images. Let’s look at the call stack of the first function call in the list to try and understand what’s going on.
The view above identifies the threads, functions, and libraries where most of your application’s execution time is spent — it is a great place to start looking for performance bottlenecks and see where time is most often spent. From here, we can click on a frame and open it in a top-down or bottom-up view.
*We recommend looking at the bottom-up tree view to surface the call trees associated with the most expensive frames, ordered by self-time.
In our case, we see that some of our application code is taking ~400ms to execute, so let’s open this frame in a top down view to gain more context as to where this code is being invoked from.
From the stack trace above, we can see that our code is doing a few things:
- We are updating the render tree via CATransaction::commit()
- Our render pass is decoding an image — we know that because the calls further up the stack point to ImageIO which results in calls to MediaToolBox and ultimately, AppleJPEG.
Alternatively, we can also look up fetchBackdrop in our application’s source code to validate our assumption. In our case, we use Kingfisher to download and cache the image on a background thread, but dispatch to the main thread to render the image. From the call tree, we can see that the image decode happens on the main thread when the transaction is committed.
There are a few reasons why you should be concerned about long image decoding times, as Apple noted in a WWDC 2018 conference talk:
Solutions to long decoding times
There are two separate things we can do to improve image decoding — if you are using a backend service or a CDN provider that supports image resizing on the fly, you should first see if you can request smaller images from that service, this will have the added benefit of decreasing the memory footprint of your application while also reducing the decoding time. Smaller, non-contiguous blocks of memory may reduce the risk of out-of-memory terminations as well.
While reducing image size will improve decoding performance and memory footprint, it may not be enough to prevent frame drops on the main thread.
One option is to preload our images and force their decoding (shown by Peter Steinberger in this gist). This will ensure that at render time, our asset has already been decoded and cached, while another option described at Apple’s WWDC 2018 talk is to use a downsampling technique.
I recommend you go and watch the talk yourself, it’s 30 minutes packed with great information about the rendering pipeline, memory management, and best practices for handling images.
Ship fast, test in prod
We’ve seen how detailed performance profiles from real users in production can help us pinpoint the issues down to the exact function causing the problem and how we can mitigate or reduce the impact of such bottlenecks. We believe this is a game-changer for mobile application performance monitoring, as it empowers engineers to release performance improvements with confidence and allows them to monitor production environments for possible regressions with a level of detail that traditional APM’s today fall short of.
The engineers can now focus on shipping features and letting Specto worry about continuously monitoring performance in production.
Specto is still in private beta, but we are looking to open the platform for public access soon. If you want to receive updates, sign up on our website and we’ll let you know asap as we are live!