How we reduced our ANR by three times

Anjal Saneen
OkCredit
Published in
13 min readMay 24, 2022

App not responding is one of the most frustrating experiences for users. The user is stuck with an app that is hung and must terminate it. It is ridiculously difficult to observe and fix ANR for the android app in production due to their indeterministic nature. After studying stack traces of ANR, we were able to improve ANR by 60%, as well as cold startup performance by 70%. We’ve also been featured on Google I/O and the Android developer’s blog has covered our ANR case study. The goal of this blog post is to discuss our learnings about ANR, challenges around solving ANR, how we track ANRs, an analysis of the source code of ANR generation, and some interesting ANRs that we found and how we resolved them.

What is ANR

A certain period of time needs to be allowed for the Android system to process events. As soon as the processing times out, an ANR dialogue appears, allowing the user to choose to wait or terminate the application immediately.

To discover ANR, Android implements a set of sophisticated mechanisms at the system level. ANR can be classified into various categories. According to the background/foreground of the App, each category has a different timeout.

  1. InputDispatching Timeout(5s): A major reason for this is that the button or input event does not respond within a specific period of time. On Android, the timeout is set at 5 seconds by default.
  2. Broadcast Timeout (10s): If the broadcast receiver fails to complete its processing within the specified time, the broadcast timeout message is reported (foreground timeout is 10 seconds, background timeout is 60 seconds).
  3. Service Timeout (20s): The service does not complete the processing within the specified timeframe (foreground timeout is 20 seconds, background timeout is 200 seconds).

User experience is directly affected by ANR since it forces the user to stop using the application through an automated dialogue. However, in the background, this dialogue will not be shown to the user, instead, the app is silently killed. Playstore measures ANR on a daily session level including foreground and background. Crossing Playstore's bad behavior threshold (0.47%) will have a negative impact on playlisting.

Ways to collect ANRs from Production App

1. Play Console Android vitals

Besides displaying daily ANR affected sessions percentages, the Play vitals also shows ANR method traces of all the thread at the point of ANR. Despite the fact that Play consoles offer some ANR observability around ANR, there are some limitations. It was not possible to download ANR data(This capability was added to Google’s Reporting API recently), underreporting and delay of traces, ANRs without traces, the grouping of traces is not suitable and it is impossible to know a specific ANR’s trend.

2. Firebase and AppExitInfo

ANR traces are now available on Firebase only for Android 11+ devices. Firebase collects traces from ApplicationExitInfo.getTraceInputStream(), which is available since SDK API 30. On Firebase you can also add breadcrumb to get more insight into the issue. Additionally, you can send trace files to your server for better analysis by using the code below.

if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.R) {
val am = context.getSystemService(Context.ACTIVITY_SERVICE) as ActivityManager
val exitList = am.getHistoricalProcessExitReasons(context.packageName, 0, 1)
if (exitList.isEmpty()) {
return
}
val lastExitInformation: ApplicationExitInfo = exitList.first()

if (lastExitInformation.reason == ApplicationExitInfo.REASON_ANR) {
val outputFile = File(context.filesDir.absolutePath.toString() + "/" + FILE_NAME)
lastExitInformation.traceInputStream.use { inputStream -> copyStreamToFile(inputStream, outputFile) }
//Send output file to server
}
}

3. ANR-WatchDog

The ANR watchdog is a simple library that posts messages every five seconds to the main thread and watches the UI thread block. It collects traces and reports ANR whenever the main thread is busy for more than five seconds(it can be configurable). The play store does not indicate that this is an ANR. But this could be a potential source where the thread is blocked. By fixing all slow running methods on the main thread, You can fix ANRs and the rendering performance of the App. Those traces captured by watchdog have proven more valuable than those found in /data/anr/traces.txt in our experience. We have observed cases in which system traces capture native traces from android.os.Message-Queue.nativePollOnce, while ANR watchdog is able to capture meaningful traces which point out to slow java methods. By removing slow methods running on the app’s main thread, this library helps us improve ANR, frozen frames, and App startup.

Source code analysis of ANR

The following diagram shows how method calls are structured in the Android source code. We’re taking a look at how Android handles ANR inside of the system in this article. The article is based on the android-12.0.0_r4 branch (the latest release of Android 12 at the time of writing this blog).

We reverse tracked the generation mechanism of ANR from ProcessErrorStateRecord.appNotResponding(). You can see from the image above that it is getting called from four places.

You can see ANRs can happen due to timeout from 4 components. one is InputDispatchingTimeout. Others are from Service, Broadcast Receiver, and Content providers. We will reverse engineer only InputDispatchingTimeout on this article to keep it simple. You can skip understanding the source code part if you want to skip. However, having an understanding of what’s going on at system-level will help you create more mental modals around ANRs and helps you to solve some of the difficult ANRs.

Let’s look at InputDispatchingTimeout. The InputManager class wraps the C++ InputManager and provides its callbacks. From the native layer, these methods are being called from on window responses. This triggers the notifyAppUnresponsive(), notifyWindowUnresponsive(), notifyGesture-MonitorUnresponsive() in AnrController. AnrController dumps WMS state when an app becomes unresponsive and triggers ActivityManagerService.inputDispatchingTimedOut(). Android services and broadcast receivers also have similar timeout methods. The service timeout is defined in ActiveServices.serviceTimeout(); and for the broadcast receiver, it’s defined in BroadcastQueue.broadcastTimeoutLocked().

A message is created here informing you of the trigger for this ANR (eg: Input dispatching timed out (d6e077e MainActivity (server) is not responding). This is the title displayed in play vitals. In all of those timeout methods, it generates an ANR message and calls ANRHelper.appNotResponding() with information about the process. For input Dispatching timeout, results are “Input dispatching timed out $reason”. For broadcast receivers, it’s “Broadcast of $intent” and for services, it’s “executing service $serviceName“. The same message appears on the title of play vitals ANR.

Inside ANRHelper.appNotResponding(), It spawns a new thread that calls ProcessErrorStateRecord.appNotResponding(), which is responsible for gathering information regarding ANR and storing it into data/anr/traces.txt and /data/system/dropbox. We will take a closer look at the things that are going on in this method.

Log ANR event.
Log ANR on system diagnostic event.

EventLog.writeEvent(EventLogTags.AM_ANR, mApp.userId, pid, mApp.processName,
mApp.info.flags, annotation);

Get CPU And Memory Pressure

PSI is an observability tool for Linux-based operating systems. it can be exported through the respective file in /proc/pressure/ — cpu, memory. But App developers don’t have access to this. Even if your app is super performant, ANR can happen due to the high resources used by other apps. So this information is critical to understanding some of ANR. Currently, this information is not stored in data/anr/traces.txt, But this information can get on system events.

report.append(MemoryPressureUtil.currentPsiState());
ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(true);

Dump StackTraces of all threads

Dump stack traces in data/anr/traces.txt file.

File tracesFile = ActivityManagerService.dumpStackTraces(firstPids,
isSilentAnr ? null : processCpuTracker, isSilentAnr ? null : lastPids,
nativePids, tracesFileException, offsets, annotation);

Add Error to DropBox

There is a DropBoxManager in Android that can read logs. However, we require the READ_LOGS permission to read log data. This permission can be granted only to apps signed as part of the firmware build or installed on the privileged partition.

mService.addErrorToDropBox("anr", mApp, mApp.processName, activityShortComponentName,
parentShortComponentName, parentPr, null, report.toString(), tracesFile,
null, new Float(loadingProgress), incrementalMetrics, errorId);

Show ANR Dialog

We observed a 2,3-second delay between ANR timeout and the appearance of this dialogue, as a result of a delay in capturing stack traces.

msg.what = ActivityManagerService.SHOW_NOT_RESPONDING_UI_MSG;
msg.obj = new AppNotRespondingDialog.Data(mApp, aInfo, aboveSystem);

mService.mUiHandler.sendMessageDelayed(msg, anrDialogDelayMs);

When is input dispatch time out triggered?

During on-window responses, these methods are triggered by notificationAppUnresponsive(), notificationWindowUnresponsive(), and notifyGestureMonitorUnresponsive(). In this section, we’ll examine how and when these methods are triggered by the native layer.

Callstack of notifyAppUnresponsive(), notifyWindowUnresponsive(), notifyGestureMonitorUnresponsive()

The call stack shows that all three types of ANR will eventually terminate in the InputDispatcher::dispatchOnce function. It is the thread body of the InputDispatcher thread that is started by the InputDispatcher::start function.

As soon as InputDispatcher starts after some input events occur, it calls mLooper.wake() and InputDispatcher::dispatchOnce, and then processAnrsLocked is called to determine whether ANR has occurred.
Let’s take a closer look at what happens in dispatchOnce().

The last mLooper.pollOnce(timeoutMillis) in dispatchOnce function blocks the thread, waits for the timeoutMillis time, and then calls InputDispatcher::dispatchOnce again. The Linux epoll mechanism is used by mLooper to block, and the call to wait for timeoutMillis is epoll_wait. The value of timeoutMillis is 5000, which means the call to mLooper.wake() will likely cause ANR. mLooper.wake() has multiple references from the App. We can observe more than 20+ references. Therefore, no matter how long the UI is blocked, as long as there is not a necessary event to trigger Dispatch (mLooper.wake()), ANR will not be generated. You can check that by creating a click listener with unlimited sleep, ANR will not be generated. ANR will occur only on the next input event, which means the next click or scroll.

Intresting case of ContentProvider ANR

We observed all references of appNotRespondingViaProvider() inside ContentProvider. From our analysis, all of them have been blocked, and therefore, our hypothesis is the App’s ContentProvider will not trigger the ANR. 2 meaningful refference to appNotRespondingViaProvider() is

1. getProviderMimeType(), This method is called only once by ActivityManagerShellCommand, but this class is also used by adb shell.

2. Another trigger is from NotRespondingRunnable. However, this can only be done through a SystemApi annotation.

In OkCredit App, we did not find any ANR for the content provider. Our Hypothesis is that App’s ContentProvider won’t trigger the ANR.

The challenges of Fixing ANR

The indeterministic nature of ANR presents a challenging problem regarding observability. Among the problems are:

  1. Google Play Console does not display a stacktrace
  2. Due to delays in dumping traces and the actual time of ANR, stacktrace will not point to the right place
  3. It is difficult to know the trend of ANR due to inconsistency in grouping ANR. This prevents hypothesis validation after the predicted fix.
  4. Traces lack foreground/background information. but Firebase would solve this issue by adding breadcrumbs to Android 11+ devices
  5. For fixing ANR, the relevant information is missing in-play consoles, such as CPU usage and memory pressure at the point of ANR. In multiple cases, ANR occurred when broadcast receivers and services woke up the app from the background. Due to the high concept of resources in the app, ANR may occur.
  6. It was impossible to download ANR data from the Play console. A lot of manual work is required. Thanks to the Google Play team. They recently added Reporting API which enables downloading.

In his book System Performance, Brendan Gregg outlines some of the “anti-methodologies” for solving performance issues. To fix ANR on Android, developers have been desperately following some of these anti-methodologies. Trying random things in the hope of catching a win. it can also be time-consuming, disruptive, and may ultimately overlook certain issues.

Here are some of the interesting ANRs we encounter

1. ANR in broadcast receivers and Services

Work manager and firebase services in the App make up most of this category for us. You can identify these ANR’s by their play vitals group title.

“executing service your.package.name/….systemjob.SystemJobService”

“Broadcast of Intent { act=android.intent.action.TIME_SET flg=0x25200010”

“Broadcast of Intent { act=com.google.android.c2dm.intent.RECEIVE flg=0x1080010 pkg=your.package.name cmp=com.google.firebase.iid.FirebaseInstanceIdReceiver (has extras) }”

“executing service your.package.name/com.google.firebase.auth.api.fallback. service.FirebaseAuthFallbackService”

This does not mean the bottleneck for ANR is in firebase service or work manager broadcast receiver. A number of issues have been reported with these libraries because of the title. The ANR notifies you when a broadcast receiver or service failed to complete the processing within the specified time limit. Timeouts could be caused by your code until and unless these services do long-running operations on BroadcastReceiver.onReceive(), Service.onCreate() and Service.onStartCommand().

Almost all of these ANRs occurred in the background for us. If your app is in a killed state, it has to launch the main thread, initialise content providers, and process App.onCreate(). Timeouts can occur here if the operation is slow. Typically, slow operations in App.onCreate() are to blame. There is an interesting read from bumble-tech regarding their ANR’s on notification service. They created a separate process to handle notifications. Launching an extra process consumes additional CPU and memory. Use this only as a last resort after resolving all issues in the cold startup process. Our recommendation is to optimize your cold startup time. We use FCM extensively for notification campaigns and critical background synchronization. We solved ANRs by optimizing cold startup. The following article explains what we learned from the cold startup.

In production, we measure the time spent by App.onCreate() and content provider initialization during cold startup using the following code.

Observation:

In the background, whenever the app wakes up, the app is comparatively slow since there is less memory and CPU to work with. P99 of time it takes for App.OnCreate() to be executed is 2.3x more than when in the foreground.

When broadcast receivers and services wake up the app for execution, it has to create an app object, initialize content providers, and execute App.onCreate(). This may exceed the ANR timeout limit due to CPU starvation and trigger ANR.

Recommendations:

  • Profile and optimise startup regularly using CPU Profiling, Systrace/Perfetto.
  • Benchmark Cold startup using Macrobenchmark on CI
  • Avoid doing heavy work on BroadcastReceiver.onReceive(), Service.onCreate() and Service.onStartCommand()
  • Avoid frequent broadcast receivers.

2. ANR Due to slow methods

Those slow methods should be easy to discover in this category. The thread trace will be pointing to them. We use ANR watchdog for catching all slow methods running on the main thread. The recommendation is to avoid heavy computation/IO operation on the main thread. It should be handled by the background thread.

3. ANR Due to Lock

In certain circumstances, the work that causes an ANR is not directly executed on the application’s main thread. For example, a background thread might hold a lock on a resource that the main thread needs to complete its work. This occurs frequently while using synchronized/mutex blocks.

Each thread entry in the trace contains some hints that can help you detect a deadlock. First of all, you can see the state of a thread: it usually is Runnable, Sleeping, Waiting, Native or Blocked. It tells you what state the thread was in at the moment the ANR took place, and you will be interested in the threads that are marked as Blocked. The threads marked as blocked typically tell you what mutex (lock) they’re trying to acquire as well as the thread ID of the thread holding that lock. If you scroll down to the entry corresponding to that thread in the list, you can see what mutex it is trying to acquire and which thread holds that lock.

4. ANR in shared preferences

apply() method on shared preference may cause ANR, since apply(), keeps values in-memory and stores them asynchronously on onPause(), onReceive(), etc, causing ANR. You can read more about the causes of ANR in this interesting article. Our fix for this was to change all apply() into commit() in the background thread.

5. WaitingForGcToComplete()

ANR: "main" prio=5 tid=1 WaitingForGcToComplete

WaitingForGcToComplete suggests that the main thread might be blocked by the Garbage Collector, which indicates a high memory footprint, as well as an approaching Out of Memory Error. We recommend optimizing memory allocation in the app and fixing memory leaks. We use leak canary on debug builds for this. leak canary has an option to get leak data. we send all leaks to firebase using this code.

Result

After digging deep into some ANRs and all the above changes, we managed to keep the ANR metric on play vital at 0.03%. There are multiple broadcast receivers and services that we use including FCM for notification campaigns and critical sync when the app is running low on resources. Therefore, we still have difficulty solving all ANRs. We intend to keep striving until diminishing returns are reached.

We saw how cold startups and ANRs are connected. We were able to keep the cold startup at 0.66%.

Conclusion

Due to their nature and observability challenges, ANRs are very challenging to solve. The only way to avoid ANR in Android apps is to ensure the UI thread has the least burden. In addition, we found that when the app wakes from the background, there is always a chance of ANR due to CPU starvation on the user’s device. Applications that have background services and background work are always at risk of receiving an ANR.

Keeping track of mobile health is incredibly difficult for Android apps. The system needs to provide APIs for reporting the app behavior of Android apps in production. AOSP teams have built amazing tools for local investigation. However, there are no APIs or libraries for reporting app behavior in production. Linux has many observability tools. But app developers do not have access to them.

We hope that our experience will be valuable in helping you to reduce your ANR rate and improve your application quality. Thanks for reading. A huge thank you to Niharika Arora for helping us understand Android internals and navigate around ANR.

Credits to the entire Android team at OkCredit for helping to achieve these performance improvements. Nishant Shah, Rashanjyot Singh, Mohitesh, Harshitfit, Manas Yadav, Pratham Arora, Saket Dandawate, Shrey Garg.

Useful Links

  1. https://developer.android.com/topic/performance/vitals/anr
  2. https://developer.android.com/training/articles/perf-anr
  3. https://medium.com/bumble-tech/how-we-achieved-a-6x-reduction-of-anrs-part-1-collecting-data-7c473ceb1c83
  4. https://programmer.ink/think/count-the-slots-in-shared-preferences.html
  5. https://developer.android.com/stories/apps/okcredit

--

--