How we achieved a 6x reduction of ANRs - Part 1: Collecting Data
One of the worst things that can happen to your app’s responsiveness is an Application Not Responding (ANR) dialogue. A high ANR rate may affect user experience and, potentially, Google Play search positions and featuring.
At the beginning of the year, we had ANR rate above the Google Play threshold for the Badoo application — operated by its parent company Bumble. So, we set up an ad-hoc team to work on this problem and spent a few months trying different approaches to fix it. By the end of the period, we managed to reduce the number of ANRs by more than six times.
In this blog post series, I will share our journey of reducing this type of error and describe exactly what were the most effective approaches and how you can apply them to your application. In Part 1 of this series, we will discuss what ANR is and what are the best ways of tracking them. If you already know about this part, feel free to jump straight to Part 2 where I explain our approaches to fix ANRs.
What is an ANR error?
Usually, any software that has a visual interface performs all UI-related operations and rendering in a separate “UI” thread. Android is no exception, with the main thread of the app running a loop responsible for all UI operations:
When using this loop, it is crucial not to execute any long-running operations, as they directly affect the responsiveness of your app. If you are doing too much work on the main thread, you may decrease the frame rate or even hang the application’s user interface, like this:
To provide a clear indication that the app is stuck, Android has introduced a concept called ANR. This is what the official documentation says on this:
When the UI thread of an Android app is blocked for too long, an Application Not Responding (ANR) error is triggered.
An ANR will be triggered for your app when one of the following conditions occur:
1) While your activity is in the foreground, your app has not responded to an input event or BroadcastReceiver, such as key press or screen touch events, within 5 seconds
2) While you do not have an activity in the foreground, your BroadcastReceiver hasn’t finished executing within a considerable amount of time
If the ANR is triggered when your application has an activity in the foreground, Android shows a dialogue suggesting either to close the app or to wait.
You can force ANR by simply putting Thread.sleep() invocation in a UI event handler, such as button click. Then, if you click on the button you will see a dialogue like this:
Having ANRs not only affects your application’s user experience but also, according to Google documentation, can affect search ranking and Google Play promotions.
To prevent ANRs we need to avoid performing any long-running operations on the main thread. This sounds simple enough but sometimes it can be tricky to identify the underlying problem. Therefore, it is vital to have good ANR reporting to ease this process.
Let’s see what the available options and tools are that can help with the investigation of these issues.
Analysing ANRs locally
If you are able to reproduce an ANR locally, then that’s great news because it should be fairly easy to pinpoint the source of the problem and there are many ways to dig into it.
A good start is to check the thread dump. When ANR is occurring, Android captures a thread dump which can help analyse the cause. Thread dumps are usually stored in the /data/anr/ directory. The exact name of the file can be found in the logs. Just check the logcat right after when the ANR is reported.
The thread dump contains a list of stack traces: for each thread, you will see which line was being executed at the time — basically representing the state of the application at the precise moment it occurred.
In most cases, you will find the reason for the ANR just by looking at the main thread stack trace. In other cases, check out this great documentation page from Google that describes how to investigate ANRs locally using different tools.
Google Play Tracking
Google Play automatically sends ANR reports if users opted in for the crash report collection. The Google Play console has several metrics and tools for analysing ANRs.
Firstly, it provides a way to check aggregated graphs for both absolute ANR count per day and the ANR rate. The ANR rate is defined as a ratio between daily sessions affected by at least one ANR to total daily sessions. This metric also has a threshold of 0.47%; exceed this and it will be classified as “bad behaviour”, which can negatively impact your Google Play listing.
Secondly, there is an option to check individual ANR reports grouped by similarity. In the Android Vitals section, you can check top ANR groups. This can work as the main tool for identifying the most popular reasons for ANRs in your application.
Occasionally, you may notice that the capabilities of the Google Play console are insufficient for several reasons. There are several issues with it. For example, there is no way to attach additional logging information to individual reports. Another problem is that it is impossible to tune the grouping logic. It sometimes groups ANRs with different reasons in one group, and groups with the same reason in different groups.
These issues make it difficult to see clearly what the main problems are and how to fix them. So how can we improve it?
Downloading data from Google Play
To solve one of the problems with grouping logic you can try downloading ANR reports from Google Play for further analysis. Previously, there was a way to download raw detailed reports using Google Cloud Storage, but a few years ago Google removed this option:
While it is still possible to view individual reports in the console, how can you export thousands of reports without doing tons of repetitive work?
There are many ways of scraping information from websites. The cleanest way is to download it using APIs but in our case, Google does not provide any official APIs to get this data. In this case, you can emulate user behaviour by automatically clicking on links and buttons in a browser and saving the displayed text.
A web scraper can be implemented using a commonly used tool called Selenium which provides simple APIs to interact with web pages. It was initially designed to create automated tests for web applications and is available in various languages, including Java/Kotlin.
Using this scraper, we downloaded raw ANR reports for one release and were able to perform analysis not possible with Google Play Console’s built-in tools.
For example, just by searching “Application.onCreate” in the reports, we found out that approximately 60% of ANR reports were occurring while executing Application.onCreate method.
Another way to collect more data and perform more advanced analysis is to implement custom ANR reporting. In the past, we had been experiencing similar problems with crash reporting tools. Therefore, we created our own internal crash analytics tool called “Gelato”.
It provides similar functionality as other crash reporting tools, such as Firebase Crashlytics or App Center, but allows us to fully control the data that we store, change grouping logic, and perform complex filtering:
We decided to track ANRs to Gelato as well, in the hope that it would provide additional insight. To do this we needed to know when the application gets an ANR.
Whilst there is a new API in Android 11 that provides information on the latest process exit reasons, but the majority of our users are still using lower Android versions, so this meant we had to find something else.
We found a simple but effective idea that is commonly used to track main thread freezes. We could run a “watchdog” thread that periodically tries to run a task on the main thread. If the task fails to complete within the specified timeout, then we could simply dump the current threads state and report it to our crash reporting tool:
There is a great library that implements this logic and we used it to add ANR reporting to our internal crash reporting tool. It allowed us to perform more in-depth analysis and facilitated better integration with our infrastructure. For example, now we are able to compare the main thread freezes between different A/B test variants.
Here is an example of a report in our system:
One useful tip is to collect and send recent analytics events with the report. Sometimes it can give literal steps to reproduce the problem.
If you don’t have plans to introduce in-house crash reporting solutions, you can try other third-party tools. For example, you can track ANRs as custom crashes to AppCenter or Firebase Crashlytics.
It is important to know that these reports cannot be considered as 1:1 alternatives to Google Play ANR reports. Android has slightly different rules for detecting ANRs, as we mentioned before, but it can give an overall picture of the main problems. It is still highly likely that if there are many reports of main thread freezes in some part of your app, then it contains ANRs.
So, we’ve discussed what ANR is and how we can track it. In Part 2 of this blog post series, I will share our approaches to reducing the ANR rate and provide some of our own results.