How we achieved a 6x reduction of ANRs - Part 2: Fixing ANRs

Published in

Bumble Tech

13 min readDec 11, 2020

In the first part, we discussed what ANR is and what are the ways of tracking it. In this article, you will find information on what problems we found in our application, how we fixed them, and the results we achieved.

Application Startup

The first thing to do if you want to reduce the ANR rate is to find the reasons for the errors. The most straightforward way to do this is to try to analyse top ANR groups in Google Play. When we checked the console, it looked like this:

Overall almost each crash group had a title “Broadcast of Intent { act=com.google.android.c2dm.intent.RECEIVE } …”. Checking the actual stack traces that we collected using our scraper we noticed that about 60% of reports were in Application.onCreate method at the moment when ANR was triggered.

Application.onCreate method is part of the app startup critical path and called whenever the application process is initiated. This gave us an idea to investigate how application startup time may affect ANRs.

The easiest way to check it is simply to add an artificial delay to Application.onCreate and check different scenarios. Here is something interesting we discovered:

When a user manually triggers an app launch using a launcher app, ANR won’t be reported for the blocking of the main thread in Application.onCreate even if it is blocked for several minutes
When an app is launched using a broadcast receiver then ANR won’t be reported for the blocking of the main thread for less than 10 seconds. The timing is not very strict and has some wiggle room but it is much stricter than the previous case

There is an important note for the second case: by default, if your application is in the background state Android won’t show any dialogues and this ANR will be reported silently. There is a way to enable displaying such dialogues using the “Enable background ANR dialogues” option in the developer menu. Practically speaking, it means that your users will most likely not notice anything and it will not significantly affect your application.

Given this information, we came to the conclusion that most likely the primary cause of our ANRs is that we are doing too much work in Application.onCreate method and when we use Firebase Cloud Messaging broadcast we sometimes exceed the 10 seconds time limit which causes background ANR.

We checked our internal analytics where we record the time elapsing between creating Application class through to the end of Application.onCreate method: we call it ‘cold app startup time’. This part of the application startup process is the largest for our application and includes all content providers’ initialisation and onCreate method.

Also, we split this data into two groups: cold app startup background and cold app startup foreground. We used a naive approach for the background launch detection: whether the screen is on or off. It may not be precise but it gives us an overall picture. Here is what we got:

According to our analytics on average foreground launch took about 2.2 seconds, and background launch 5 seconds. We had about 3% of background launches longer than 10 seconds which suggests that these launches could be the cause of our ANRs. To confirm or deny our theory we decided to try to reduce our application startup time.

Reducing Application Startup Time

The easiest way to identify the potential areas to improve on the critical path of app startup is to capture a CPU trace dump. There are great tools available in Android Studio that can help you do this.

There are buttons in Android Studio profiler to run an app with profiling or to start and stop it manually but there are also special system methods that can start and stop profiling from the code. These methods may help you to get consistent CPU dumps reliably:

startMethodTracingSampling
startMethodTracing
stopMethodTracing

Application startup can be captured by starting tracing in Application class static initialiser block and finishing at the end of Application.onCreate or onResume of the first activity. It may look like this:

After downloading the trace file, opening it in Android Studio you’ll see something like this:

It is important to bear in mind the following when analysing the results:

Difference between sample-based tracing and method-based tracing. Method-based tracing is more prone to skewing the captured time taken for executing specific methods, while sample-based approach is more precise but may lose some method calls. You cannot use absolute time values from both methods, but sample-based tracing gives more usable data to perform a relative comparison
Difference between production and debug app. If you want to capture CPU trace in the environment as close as possible to production you need to disable all debug tools such as leak canary, or compile the app using release build type.

After checking the CPU trace, we found several places that had taken a relatively long time to complete. We didn’t want to rewrite the whole application to optimise the startup time (although this may be sometimes necessary 🙂) so we tried to get the most results with the least amount of effort. We used different techniques and here are a few of them that you can apply for your application:

Correctly scope your components

One of the easiest ways to optimise your application startup time is to go for lazy initialisation of all your components. It might sometimes be tempting to create some data structures or classes in the application scope, so they are available during the whole application lifetime but this is best avoided, if possible. Additionally, correct scoping may improve not only the startup time but also memory usage because you won’t be holding references to these objects for the whole application lifetime.

Background and delayed initialization

If you still need to initialise something when your application starts, ask yourself if you can postpone it for at least a few seconds. If the answer is yes, then it makes sense to try to run this code on a background thread. If it is not possible, then even just delaying initialisation for a few seconds may help. This is because, by distributing the tasks over time, you are reducing the chance of blocking the main thread.

Third-party content providers

Some third-party libraries can invoke initialisation code using content providers. It is always important to check what you have in the merged AndroidManifest.xml as, by default, Android build toolchain will merge all content providers from all third-party libraries, and it may affect application startup time.

If you don’t need automatic initialisation of content providers for certain libraries, you can alter the manifest by adding the following entry to your manifest:

This can be useful if you want to manually control the initialisation process: for example, you can initialise a library by invoking the same code that was in the content provider only when you actually want to use the provided functionality. But be careful with this approach and perform thorough testing afterwards as the library may have been designed for Application-wide initialisation.

When we released an update with these optimisations, we reduced the 95th percentile of cold application startup time by about 50% (from ~10 seconds to ~5 seconds):

And what about ANRs count? Here is what we’ve got from the Google Play console:

Absolute ANRs count graph in the Google Play console

OK, we found out that app startup directly affects the ANR rate and we significantly reduced the ANR count by one third. But it still turned out to be above the threshold so we had to continue our search.

SharedPreferences and apply() method

Another thing that we noticed during analysing of the ANR stack traces in Google Play and examining our internal reports, there were these strange groups:

These showed there are many cases when the main thread is blocked on something related to shared preferences. Generally, it is not recommended to do disk writes on the main thread because disk IO can have unexpected delays and cause UI freezes.

We checked our codebase and we found no usages of commit method (which performs blocking shared preferences writing to disk). All our modifications were done using the apply method. So, how is it possible that we are experiencing such issues when we are using non-blocking API?

Going deeper into Android source code will help us see what is happening. The default implementation of shared preferences is located in SharedPreferencesImpl.java. During the editing of shared preferences using the Editor interface, all modifications are saved in a temporary hashmap which is then applied to the main in-memory cache on commit or apply invocation. When it applies the map to the in-memory cache it also calculates which keys were changed to use it further when we need to write these changes to disk and notify shared preferences listeners. This information is stored in MemoryCommitResult.

If we check the body of apply() method we can see that it schedules background disk write using enqueueDiskWrite(), nothing wrong with that. Here is the simplified implementation of apply method:

Taking a closer look we see that it first creates a Runnable that waits for writes in a synchronous fashion. This runnable is added to QueuedWork. When we check JavaDoc for these classes we see the following:

Internal utility class to keep track of process-global work that’s outstanding and hasn’t been finished yet.
This was created for writing SharedPreference edits out asynchronously so we’d have a mechanism to wait for the writes in Activity.onPause and similar places, but we may use this mechanism for other things in the future.

This class holds all pending asynchronous operations in a list with the ability to execute them in a synchronous way on several events such as Activity.onStop, Service.onStartCommand and Service.onDestroy. This was most likely done to reduce the probability of losing data on unexpected process shutdowns.

We saw that if you simply use standard Android components it may sometimes lead to running all pending shared preferences disk writes on the main thread. If we do any of these operations right after calling the apply method then it effectively becomes a synchronous commit method.

We suspect that this was done to reduce potential data loss when the application gets killed by the system. But how likely is such a situation? We usually use shared preferences when the application is in the foreground state, and it is highly unlikely that Android will kill the foreground application. Moreover, we frequently use shared preferences to cache some values that can be easily restored from the server-side, so losing it is no big deal.

To check how shared preferences affect the ANR rate and whether this synchronous apply logic is useful, we decided to implement an A/B test that disables it. To do this, we replaced all creations of shared preferences with our factory function:

Now we have control over the implementation of shared preferences in our application and we can create an alternative one. We created a simple class that delegates everything to the original implementation except the apply method: here we call the commit method instead and schedule it to be executed on a background thread. The implementation is very similar to a library that provides an alternative to shared preferences: binary prefs with the exception that we don’t change the serialisation/deserialisation mechanism to ease backward migration in case of problems.

So, having introduced a new implementation we can use it under A/B test:

Next, we started slowly rolling out the A/B test, keeping an eye on the metrics. We have very good coverage of product-related metrics that include any that might be affected if there are problems with the new shared preferences implementation.

As a result, we found no issues with the new implementation and we achieved about a 4% reduction of total ANRs compared to the control group. That’s not bad but we still were above the required threshold.

Push notifications handling

Still not being where we wanted to be, we had to figure out a way to reduce the ANR rate even more.

We had already found out that there is an almost linear correlation between app startup time and ANR rate which most likely due to the fact that the majority of ANRs are happening during handling broadcast receivers on Application.onCreate stage. Unfortunately, we had exhausted all the easiest ways of reducing the startup time and everything that remained would require major refactorings and a lot of work.

Another thing that we had noticed when analysing our analytics is that most of our process launches were coming from handling push notifications broadcasts. Based on this we’ve got an idea: maybe we could perform all notifications handling in a separate process that doesn’t need all the data structures and services to be initialised on application startup?

Just for some context, by default on Android, your application is running within a single process. When you click on the app icon Android will create a process for your application. When your application receives a push notification broadcast it will start a new process if it has not started yet. Each time when Android starts an application’s process it invokes Application.onCreate method:

But there are ways to run some parts of the application in separate processes. In such case, Application class will be instantiated for each process instance independently and there will be no shared memory between all processes. It can help in our case because we can move all push-related operations to a separate process and remove almost everything from the Application.onCreate method in the push process. This way, we both significantly reduce push broadcast handling time and lower the probability of getting an ANR:

You can control which process your component will be launched by using AndroidManifest.xml. For example, to run a broadcast receiver in a non-default process we need to add a name for the process using the “android:process” attribute in the “receiver” tag. But how can we do this for external libraries such as Firebase Cloud Messaging?

There are special tags that control the manifest merging process. We can patch the original FCM broadcast manifest declaration using tools:node=”replace” attribute. Apart from FCM receiver, there is a FirebaseMessagingService that is also responsible for handling push notification and we want to run this in a separate process too. Overall, we need to add the following manifest entries:

Now, whenever Google Play Services sends a broadcast with a new cloud message, we will be able to skip usual initialisations in Application.onCreate by checking if are we in the main process or not:

There are many ways of checking this. You can check one of the implementations here.

Most likely if you don’t do anything special with pushes this will work out of the box: your code responsible for displaying notifications will be executed in a separate process and if you are passing PendingIntent for activities they will be launched in the default process.

After implementing this we immediately noticed that now we display notifications significantly more quickly than before:

Notification handling comparison

This gave us hope that it should improve our situation with the ANR rate as well. We released a new update with these changes and carefully rolled it out while keeping an eye on the main metrics.

After about a week most of our users had updated to the new version and here is what we then had:

Badoo ANR rate dropped from 0.80% to 0.41% and eventually dropped below that of our peers, to 0.28%

Absolute ANRs count per day dropped by more than 2 times

Although these changes had a significant impact on ANR rate there are a few things that you need to consider before implementing them in your app:

Modifying FCM client library behaviour by patching the manifest may not be intended by the developers of the library. There are no official guides or documentation on how to do this and potentially it can be broken at any point in the future, so it requires thorough testing after each client library update.
Updating FCM libraries now requires extra caution because we must ensure that our patched XML manifest entries are aligned with the original ones.
Launching an extra process uses additional memory and CPU footprint.

Given this, I would only recommend using it only as a last resort or as a temporary fix to buy some time. If you find it possible to reduce the application startup’s critical path then it is better to try to focus on it first as it will not only help reduce the ANR rate but also the cold app startup time, both clearly benefit to users.

Overall Results

After all these changes, we had managed to reduce our ANR rate and absolute errors count by six times:

I hope you find our experience useful in helping you to reduce ANR rate and improve your application quality.

Have you encountered any interesting ANR bugs? Do share your experience in the comments below :)