Better Android Testing at Airbnb — Part 6: Consistent Mocking
In the sixth part of our series on Android Testing at Airbnb, we look at common sources of flakiness and how they can be mitigated.
In the previous article we detailed our test framework implementation. This was a high level look at the approach we take to running our tests. However, many small details have been added to the framework in order to make it as stable as possible. This article details the most important ones.
Obstacles to Consistent Mocking
A constant battle with our test framework is minimizing sources of flakiness. These can manifest in many different ways, and lead to things like slight screenshot variances, differing fragment states, or spurious crashes.
Flakiness issues are compounded because of our use of Flank, which dynamically assigns tests into shards. This means that test ordering is constantly changing. One test may put the app in a certain state that affects a later test, but this won’t be consistent because of the unpredictability of the test ordering.
Using Android’s Test Orchestrator can help prevent some of these issues, but unfortunately we can’t currently use it because it makes our tests take seven times as long. Instead, we take the approach of manually clearing shared state between tests, and taking pains to prevent memory leaks which might lead to crashes after many tests are run.
Below we present some of the flakiness issues we have run into, and how we have resolved them. In general, we’ve found it necessary to have all product features use the same patterns and architecture so that sources of flakiness can be fixed once in the underlying tools, enabling a scalable solution. This may require providing a wrapper around vanilla API’s to enforce how they are used or mocked.
There are many ways to execute code asynchronously, and an engineer may choose to do something unpredictable in their feature. For example, common tools for asynchronous code are RxJava, Kotlin coroutines, AsyncTasks, Executors, or even creating manually creating and managing Threads. Any custom approaches are impossible to be controlled by the test framework. Often this is ok because the Fragment UI should be frozen to mocked State that can’t be changed, but occasionally asynchronous code can cause unexpected side effects.
To minimize the chance of unexpected side effects, we provide our own functions as access points to executing asynchronous code and encourage engineers to use those. This allows the test framework to either block the code from running, or detect when it is finished.
For example, MvRx uses an extension function named “execute” to subscribe to RxJava Observables. The MvRx mocking system allows the behavior of “execute” to be changed so that the observable is never subscribed to.
Additionally, we use dependency injection to inject a test Coroutine Scope into our view models so that coroutines are never actually run.
In both these cases we can report that an Observable or Coroutine was executed in response to a click, so that corresponding details are shown in the final interaction report.
Cached State in Views
The Android View class often caches data in order to improve performance. For example, a draw call may do nothing if the view does not consider itself “dirty,” and measurement details can be cached so that subsequent measure passes don’t have to recalculate anything if the view has not changed. These caches can be problematic in screenshot testing when we need to force layout the Activity to show an entire RecyclerView and then draw everything to a custom Canvas. For example, ConstraintLayout has a measure cache that was causing our forced screenshot layout to not display as expected.
Our solution was to iterate through the view hierarchy and call invalidate() and requestLayout() on all views. This guarantees that they layout correctly for our screenshot, and that they completely draw themselves to our custom canvas.
If one mock changes shared preferences it can affect a later mock. Android Test Orchestrator can’t solve this problem either because our framework runs multiple mocks in a single test. The solution is straightforward, have the test framework clear Shared Preferences after each mock. This can be extended to any other type of storage such as cache, local files, or databases.
Fully mocking dates is crucial to reducing tests flakiness. Dates are inherently flaky as UI code commonly references the current time. As tests are rerun with ever changing Date values, screenshots can show text that is constantly being updated, and interaction reports can also be affected if any arguments includes Dates.
The only solution here is to be able to mock the Date framework so that calls to get the current date and time return a consistent mocked value. At Airbnb we use JodaTime, wrapped around a custom internal API that hides the JodaTime implementation details. This allows us to intercept and mock any calls to now() or today().
A recurring problem for us was screenshot differences due to slight pixel variations in icons that were loaded from drawable resources. For performance reasons, Android maintains a cache of drawables loaded from resources, and the underlying bitmap can be shared in multiple locations. Due to our changing test ordering we didn’t have consistency in which tests encountered already cached bitmaps, and the cached version could vary.
In one case the issue was caused by code that modified the drawable’s Bitmap — changing it to call mutate on the drawable first prevented the cached version from being changed and fixed the flakiness. While you should take care to call mutate on shared Bitmaps for this reason, we weren’t able to use this approach to fix all of our issues with drawable flakiness.
We were able to solve the flakiness by forcing the cache to be cleared after each screenshot. There is not a clearly exposed API to do this, so we use an approach that takes advantage of the fact that the cache is cleared on incompatible configuration changes, which we can force like this:
Note that this approach isn’t ideal, it uses restricted and deprecated API’s, and relies on understanding implementation details of what these functions do. We only test on API 28 right now and it is working great for our needs at the moment, but may need to be adapted in the future.
Out Of Memory Exceptions
There are two potential sources for these crashes:
- Memory leaks in the code under test
- Inefficient management of bitmaps in the screenshotting process.
To minimize memory leaks, we keep the target duration of each test shard to three minutes, to minimize how many tests are run in a single process. Additionally, we run leak canary to detect and report leaks.
In the screenshotting library we may have to capture bitmaps up to 40,000 pixels long. In order to manage this, we reuse the same bitmap across all screenshots, and recycle and create a new, larger one if needed. We also run the app with large heap enabled in the manifest.
We found that features occasionally executed Runnable callbacks with Handler#postDelay. A common use case for this was to emphasize some UI animation, such as waiting to finish an activity after a confirmation message is shown. While we are able to detect when the main thread is idle (as discussed previously), this idle detection cannot be applied to Runnables that are posted to run later, so we have no way to account for the delayed code.
In this case, we have feature code call a wrapper function when they need to postDelay a callback. This wrapper executes the runnable immediately when the test framework is active. Another benefit to this wrapper approach is that we have provided a function that is more idiomatically Kotlin, such as Fragment.post(delayMs: Number = 0, callback: () -> Unit).
Note that in our State based architecture, ideally the UI does not execute arbitrary actions that are not tracked in the State. However, sometimes this is needed for animations, and is the simplest way to implement a basic UI behavior. This is fine as long as the UI is robust enough to recover from configuration changes that may interrupt the posted callback.
For screen behavior to be deterministic and controllable the screen should only use data from the ViewModel’s State. At first glance this seems straightforward, but we occasionally have cases where it is violated. This can happen if the Fragment has any injected dependencies that it references, static method calls that may access a singleton, or OS level calls that may be variable, such as getting device Locale.
In our app we still have some legacy systems that are accessed via static methods instead of dependency injection, and those were commonly accessed incorrectly. Besides adding the possibility of test flakiness, abusing the State pattern like this also means that the test framework is not testing variations in the missed data. Lint rules can help to prevent anti-pattern usage like this, and good project architecture with dependency injection can enforce best practices.
UI that displays images must load those bitmaps asynchronously, which is problematic for testing. While loading images is ideal because it fully tests the image loading code and behavior, it is troublesome for several reasons:
- The test framework must be able to detect when all images have loaded.
- Image loading can have several states, such as a loading placeholder, incremental thumbnail, failure asset, and the final success state. We can’t easily test all of these separately, or easily differentiate between them.
- If images are loaded from network then there is the possibility for sporadic failure due to network issues, which results in flakiness.
- Waiting for images to load increases test time.
- Even if we allow images to load, when a screenshot is taken our test framework synchronously lays out the full activity including all RecyclerView items, which may not have originally been laid out. In this case we can’t wait for images in newly laid out views to be loaded since the screenshot process happens synchronously.
In our app, we compromise by overriding any image load requests and instead forcibly inserting a local drawable resource, which is loaded synchronously. This has the following benefits:
- Shows an actual image in screenshots instead of a blank spot
- Covers some image load behavior, such as ImageView scaleType
- Works synchronously so no complexity is needed to wait for image loads
While this solution is not perfect, it has been a good compromise for us. Additionally, we still allow screens to load different images in mocks by having a set of test urls that map to different local test assets. Mock state can choose which image urls to use to vary the type of image loaded in the screenshot, which helps to improve the quality of the screenshot.
This approach is possible because we have a centralized, custom ImageView architecture which all features use. This allows the mocking to happen in a single place inside our image infrastructure, completely opaque to product engineers.
Lastly, our JSON report captures the url that was set on the ImageView, so even though we don’t screenshot the resulting picture, we still test that the Fragment loaded the expected url.
Webviews face similar problems to ImageViews. Network requests to load content are flaky and difficult to accurately wait for. Additionally, the content that is loaded is out of our control and could change the screenshot bitmap at any time.
Consequently, we block all WebViews from loading content during our tests. This is accomplished by wrapping the Android WebView in our own custom view so we can mock it out in one central place.
And again, like ImageViews, our JSON report captures the intended url to load, as well as other data such as user agent, headers, and request type. This helps to validate even more data than a screenshot could have.
By default, RecyclerView’s LinearLayoutManager prefetches off screen views outside its viewport while the UI thread is idle between frames. We found that this can cause flakiness, because if a view is fully laid out by the system it can behave differently than if it was laid out synchronously by our screenshotting system. Notably, animations will be completed to 100% in the first case, but will only be in the starting state in the second.
Also in general, we find it is best to disable systems that don’t deterministically behave the same way. In this case the prefetching may or may not happen depending on how long we idle for before taking a screenshot.
Instead, we disable this behavior; after the test activity lays out a fragment it traverses the view hierarchy for RecyclerView instances and disables prefetching on all LayoutManagers via setItemPrefetchEnabled(false).
Next: CI Setup
This article presented a collection of common sources of flakiness and how we try to stomp them out at the roots in our test framework.
Next, in our final article, we will cover how our tests are automatically generated and run on CI when an engineer makes a code change.
This is a seven part article series on testing at Airbnb.
Part 3 — Automated Interaction Testing
Part 6 (This article)— Obstacles to Consistent Mocking
Part 7 — Test Generation and CI Configuration
Want to work with us on these and other Android projects at scale? Airbnb is hiring for several Android engineer positions across the company! See https://careers.airbnb.com for current openings.