Part two of this series covered our screenshot testing approach. As our first integration tests, screenshots filled a gaping hole in our testing strategy and were immediately effective at catching regressions and improving our developer experience. But while they cover a large swath of our app’s UI code, they provide no insight into the correctness of code that is run when the user interacts with the UI, such as click handling.
This “interaction handling” code can contain complicated logic and be a common source of bugs. It also represents a large percentage of the code in a product feature, so for high code coverage it is important that it is well tested.
Testing interactions is fairly straightforward with Espresso — clicks can be manually forced and assertions on results can then be made. However, these tests are brittle for a variety of reasons:
- Views are manually identified by id or position, which commonly change across product updates
- Views in scrollable lists must be scrolled to
- Asynchronous results must be waited for, which can cause flakiness or require extra code to handle correctly
Even if these are addressed, manually writing tests for all possible interactions on a page is tedious, and likely to exclude small details such as the arguments that are passed or network requests that are made.
Just as screenshot tests automatically detect UI changes, we built a similar system to detect changes in interaction handling. We then leverage Approval Testing techniques to automatically update our tests. This allows us to automatically verify each screen’s behavior without writing any traditional Espresso tests.
The philosophy behind this is based on the following:
- All changes that result from a click are measurable, and can be represented with a textual description.
- All views in the activity hierarchy can be programmatically clicked and the results measured, allowing us to generate a report that maps each view to its onClick behavior.
- We can test a screen in isolation, and define its interface as any actions that may affect other screens, such as starting a new screen or returning a result.
- We don’t need end-to-end tests that link screens as long as for each screen we test how it handles possible inputs (mock states and arguments) and validate correct outputs (actions that affect other screens)
Our implementation of this is as follows:
- A mock is laid out and we wait for its view to stabilize.
- We iterate through each view in the fragment view hierarchy and programmatically click it
- After each click, we record any actions that result, blocking them from actually occurring
- A JSON file is produced that defines the results for each view
- JSON files are diffed to detect changes with interactions, exactly as we do for screenshots
This technique works surprisingly well, and has a lot of parallels with screenshot testing. In fact, we can reuse much of the same infrastructure we already built to run screenshot tests. Let’s look at each step in detail.
This reuses the same code as the screenshot tests. We have a base Activity that takes a list of mocks to test, and iterates through each one, displaying it and running test code on it once it stabilizes.
Subclasses of this base Activity handle specifics of a test, such as screenshotting or performing clicks.
Iterating the View Hierarchy
Each view in the screen must be processed in turn. To accomplish this, we do a depth first search through the view tree and act on each view. We check for both clickable and long clickable views, and perform each if applicable. There are two difficult things about this process.
First, supporting RecyclerViews means that we need to programmatically scroll the screen down to reach every item. This requires asynchronously waiting for the new views to be laid out before continuing with the depth first search.
Second, each click can potentially change the view hierarchy, so we can’t immediately continue testing the next view. For example, the click could trigger a fragment transaction, show a dialog window, or expand some ViewGroup. Instead, we need to reset the view hierarchy to its initial state and then resume iteration from the previous view in the hierarchy.
The reliability of the test hinges on our ability to accurately reset the app to the original mock view after each click. This lets us smoothly continue on to test each subsequent view.
To support this resetting, we run the entire test in a single activity. After each click we remove all fragments and then add back a new instance of the mocked fragment.
If an AlertDialog was shown as a result of a click we programmatically close it. There is no clean way to do this, and we instead rely on reflection to access the global window manager.
Before each click we store our traversal location in the view hierarchy. After the view is reset the test resumes from that point. This traversal path is represented as a list of the child indices of each view group in the hierarchy.
After each click, the test “listens” for resulting actions and records them. There are generally two categories of actions that can result:
- Android framework level results, such Fragment transactions or starting/finishing an Activity
- Airbnb app-specific events, such as submitting a change to MvRX state or executing a network request
Ideally the test framework can automatically record any results affecting the Android framework, but we needed a clean way to also detect changes to any of our internal systems.
Detecting Framework Results
To catch Activity level results, our test activity simply overrides all possible methods from its Activity super class. For example, we override finish, startIntent, and onOptionsItemSelected. Calls to any of these methods are recorded, and the super call is blocked to avoid a side-effect from interfering with the test (we don’t want the click to actually finish our activity!). We also record details about the parameters that the functions are called with, such as Intent information or Bundle data.
We also check whether a result was set on the activity, so we wait until the interaction is over and then use reflection to inspect the values of the Activity’s result code and result data. This allows our test to catch changes to returned results on click.
To detect Fragment transactions, a FragmentLifecycleCallbacks is registered on the test activity, and recursively detects any changes to the Fragment stack. It records the ending state of the fragment stack after everything has stabilized. We also record the arguments that each fragment contains, so we have a record of which arguments each fragment was started with.
Finally we use reflection to access WindowManagerGlobal and check for windows added as a result of the click. If it was an AlertDialog or BottomSheetDialog we can get information about it such as the title, message, and button text. We also force close dialogs to prevent them from sticking around as the test progresses.
Detecting Changes to Custom Systems
Ideally our interaction test framework can also capture any changes in custom systems, such as our network layer, database, or logging. However, we’d like to avoid dropping test code into production systems, which would be the naive approach to recording events.
Instead we leverage interfaces and dependency injection to decouple the test interaction recording from the actual system. Here’s how we approach it.
- Create an interface that knows how to report actions to our test runner
- Use a test Dagger module to override creation of each dependency, and mock it to instead invoke the interaction reporting interface.
- Use Dagger multi binding to collect these reporter interfaces into a set that the test runner can be injected with.
A well thought out dependency injection graph, combined with multi binding, is crucial for this to work well. Once it is set up, it is extremely powerful because it allows us to measure and catch changes to how every click in the app interacts with our services.
Capturing Non-Visual View Data
Beyond recording the result of a click interaction, our system can also help test the non visual behavior of a view. These are data that aren’t captured in screenshot tests, such as:
- The contentDescription of a View, to check accessibility configuration
- The url loaded in a WebView or ImageView
- The configuration settings of a video view
To support this, the view iterator calls back with each view and gives us an opportunity to check its type and add arbitrary information about it to the report. This makes it extensible for any custom views or data about the view we want to capture.
Knowing When an Interaction is Over
When a view is clicked it may trigger asynchronous actions, such as a Fragment transaction, a view invalidation, or some data processing. We can’t allow these asynchronous actions to affect the stability of future tests, so we either block them (when possible) or wait for them to be done before proceeding to reset the view for the next click.
This idle detection is discussed in detail in Part 5 (coming in a few weeks).
JSON Report Output
Once all views have been clicked and results captured, the data is compiled into a report. The best format for this report is subjective, and there are many ways to present it. Our format looks like this:
This JSON object declares the behavior of a single view on the screen. The full report will have an entry like this for each clickable view.
The top level JSON object key identifies the view in the hierarchy. We use the view ids of each of the view’s parents to construct a chain that allows us to uniquely identify the view on screen.
We also note that this is a TextView in a RecyclerView. It’s within a AccountDocumentMarqueeModel, which is an Epoxy model representing the item view. Details like these allow developers to easily figure out where on screen this JSON refers to.
Finally, the report notes what happens when the view is clicked.
This represents that we are opening a UserProfileFragment in a MvRxActivity, and also notes the arguments and request code that are being passed with it.
Through trial and error we arrived at this format with these points in mind:
The report should clearly describe what each view on the page does when it is clicked. Key names should be carefully chosen to make meaning intuitive.
While the report can contain metadata to help the user more easily identify which view has been affected, counter intuitively this should be minimized because it can harm consistency.
For example, if the metadata includes a RecyclerView item’s index (to make it easy to read which item changed), then if a new item is added it can change the indices of all other items and caused a large change in the report. In this way the goal of readability and consistency can be at odds.
While readability is important, it should not come at the cost of diff-ability or consistency. An item in the report should ideally only be shown as changed if its behavior was actually affected, otherwise reports become too flaky and burdensome to read.
We need to be able to compare reports and easily identify changes. We use JSON because there are good tools for JSON diffing, it is fairly readable, and it makes it easy to associate key/value pairs.
It is critical that each view has an identifier that is both unique within the view hierarchy and stable across branches. This identifier is the key to the set of changes associated with the view, and changes to the key result in a confusing diff. We use an identifier that represents the child view of each view group in the hierarchy above a view. We avoid index when possible, because that is subject to change when other views are added — instead, we use view id.
If a diff shows that something changed, we need it to be easy for engineers to read the diff and identify the difference. If this isn’t easy then they are more likely to ignore a diff when it may represent a real regression.
The report should be consistent across branches, and only change when a view’s onClick behavior changes. If a PR reports an interaction diff when no behavior change has actually happened it conditions users to take these reports less seriously, and may lead to real regressions being missed.
JSON diffs are harder to read than screenshots diffs — screenshots are fairly obvious in indicating a visual difference whereas JSON diffs can require some study to understand what has changed (which is why the report must have good diffability).
For these reasons, consistency is very important, and we have made some design decisions to optimize for it. For example, JSON object keys are sorted to avoid diffs that may be a spurious change in action order
One consistency problem we ran into was that the text representing data (such as Bundles or Intents) may not be consistent across runs.
There are two main reasons this may happen.
- A class does not implement toString(), and instead uses the default implementation where its hashcode representation is used — eg Person@372c7c43. To combat this, we use reflection to generate a consistent String representation based on properties in the class, recursively. We do this if we see the hash pattern in the original toString(), or if the object is a Kotlin data class.
- If an object is an integer it may represent an Android Resource value. While these are constant for a single build, the integer values representing the same resource can change across builds as other resources are added or removed. To stabilize this, we use our reflection based string representation from (1) to lookup integers in the resource table, and if there is a match we use that resource name (eg R.string.title_text), instead of the integer value.
Kotlin data classes are targeted for a custom String representation because of point (2) — the data class is commonly used to pass arguments and was the main place we saw String resources showing up. Additionally, since their toString() is already generated and unlikely to be custom it is safer for us to replace it with our own generated representation.
How Various Actions Are Represented
In the above example you saw a JSON report indicating that an Activity was launched, including the arguments and flags included with it. Reports can capture any other type of data we want, as long as we can programmatically define it. Here are a few examples from our system.
Finishing an activity
We capture when a click finishes the Activity, and any result data that was set to be returned.
Emitting a log
Our internal logging system uses a schema based approach. We note which schemas were logged, and how many times.
We previously mentioned that arbitrary properties on a View can be recorded as well. This captures non visual information about the View at the time it was clicked, such as content description.
We capture other properties too, such as urls set on images.
Selecting a Toolbar option
If our Toolbar contains any options, those are all clicked as well. We record the option name and id, as well as the resulting actions that it may trigger afterwards.
All Fragments have their arguments and lifecycle state recorded in the report, so we can detect Fragment navigation changes.
Network request executed
All network requests resulting from the click are recorded. We can get detailed information such as request type, parameters, headers, and body.
Updating ViewModel State
We detect changes to each ViewModel’s state, and record the exact property that changed and what the new value is.
We can even explicitly call out additions or removals to Lists and Maps.
Overall, this JSON system allows us to record information as granularly as we want, which makes our tests extremely comprehensive. Manually writing Espresso test to assert these same checks would be tedious to the point of impractical. Instead, all of our data is generated automatically, viewed through a nice UI, and changes are approved and updated with a single click.
Diffing Reports to Find Changes
Once we have our report, how do we actually track and report changes? The system we use for this is almost identical to screenshot testing, and thankfully, Happo supports JSON diffing as well. This means that we can reuse our Happo screenshot library to snapshot JSON, upload it to AWS, and generate a Happo report.
The JSON snapshots are combined with UI screenshots to create a single report representing the behavior of a branch. Happo’s web UI shows both JSON and UI diffs, and our JSON diffing leverages all of the existing tooling that Happo offers, such as change subscriptions and component history.
For example, here is a report for a PR that changed the behavior of a ToggleActionRow to make a GraphQL request for a ListingsQuery. We’ve automatically captured the behavior change and can present it clearly.
Additionally, we didn’t have to make any changes to our CI setup because these are just additional JUnit tests added to our existing app instrumentation test suite. The JSON diffs are added to the existing Happo report that the screenshot tests create. This is explained further in a subsequent article on our CI setup, and shows how easy this system is to extend.
Possible Future Extensions
We’ve only been using this interaction testing approach for a few months, but so far it is working well and is a promising solution to automatically generate integration tests. The Approval testing approach minimizes effort to create and update tests, and allows them to be more exhaustive than manually written tests.
While we have first focused on capturing common actions and low hanging fruit, we don’t yet completely capture all possible interaction behavior on a screen. We can continue to improve our test coverage:
- Finding EditTexts in the view hierarchy, programmatically changing text, and observing results
- Capturing the behavior of onActivityResult callbacks
- Recording what happens when the fragment is setup or town down (such as network requests or logging) and including that in the final report
Next: Testing ViewModel Logic
This article covered our solution for automating interaction tests. These tests capture ViewModel behavior on click such as making a State change or executing a request, but this can’t comprehensively test all edge cases of the ViewModel.
In Part 4, we’ll look at how we use unit tests to manually test all logic in a ViewModel, as well as the DSL and framework we created to make this process easy!
This is a seven part article series on testing at Airbnb.
Part 3 (This article)— Automated Interaction Testing
Part 6 — Obstacles to Consistent Mocking
Part 7 — Test Generation and CI Configuration
Want to work with us on these and other Android projects at scale? Airbnb is hiring for several Android engineer positions across the company! See https://careers.airbnb.com for current openings.