A/B Testing on Android with a clean Architecture

Damian Burke
9 min readApr 1, 2018

--

Implementing a clean architecture is a hot topic in the development world, especially in the Android community. Using interfaces to define entry and exit points, extracting business logic, keeping the view layer as dumb as possible, dependency injection — and more. Depending on the size of the code-base, it may seem that following these architecture proposals adds overhead to achieve simple goals.

Reading through discussions about why cleaning your code-base is recommended, most of the comments will point towards increased testability. On Android it is important to keep the business logic platform-agnostic to allow testing via JUnit — instead of requiring the Android SDK — which vastly decreases the execution time.

Another benefit offered by a clean architecture is modularization. This has two main benefits: It allows to work on for example feature modules without touching the other modules, creating an independent scope to work in and thus reducing conflicts. On the other hand, due to incremental compilation, it will decrease the application’s compile time and speed up development in general.

Taking a look at a clean architecture, usually a certain set of layers emerges:

The communication between each of these layers is defined by interfaces, allowing to easily mock incoming and outgoing layers for unit and integration testing. On the other hand, it allows us to replace components like the presenter or the view which results in different behavior or displaying different user interfaces.

This is where it is time to think about A/B testing.

A growing user-base with good ratings indicates that people like your application, which is good. To keep your application’s growth rate stable you are most likely trying to improve your application by adding new features or improving certain flows and funnels. Stalling the application’s development allows competitors to catch up and copy — or even improve — your application’s unique selling points. In the start-up world this might become a problem. On the other hand, rushing features into your application is probably not the best idea either. Whenever adding new features, you want to make sure your user-base is actually going to accept, use and like the feature. Going back and thinking about the diversified user-base, creating features that most of your users are going to like seems to be a hard task.

This is where we will reap in the benefits of the application’s architecture.

Being able to simply swap out for example the view layer for a certain feature allows us to test our shiny, new feature. As an example, we can think of an application displaying a list of videos and allowing the user to stream them onto their phones. Initially, our screen was just that — a plain list with the video’s title and a thumbnail. This seems to work, people look at the list, they know what it is and they watch our provided videos — great.

Our baseline. A simple list of video-titles.

Trying to improve the screen, we want to add video suggestions above the list — for example to drive users towards consuming our best video, our highest-rated video or simply the latest video we’ve added to our database.

Our test variant with a recommended video on top of the list to draw the user’s attention.

To achieve this, we have to change our View layer. We need to add an appropriate view element into our layout to display the recommended video, as well as a method in our View interface to display the recommended video.

Also, our Presenter layer needs to be adjusted to forward the recommended video to the view.

Now that we’ve adapted our View interface and our Presenter, which is passing on the recommended video to the attached view, we also have to change our View implementation to reflect these changes and actually display the recommended video.

Since we want to test if our new feature improves the application’s quality, we will duplicate our previously implemented view and create a new file:

VideoListActivity.kt

VideoListExperimentActivity.kt

(To make it more obvious you can structure your experiments and their files in packages or use a more appropriate naming convention with for example the experiment’s name.)

The VideoListActivity.kt is our current implementation of the View interface. Since we have to comply to our interface, but we are not using the recommended video in this View implementation, we will implement it as a no-operation.

In our VideoListExperimentActivity.kt, which is our duplicated file, we want to actually display the recommended video.

(In our Android application, we’d also have to adapt the layout XML file — we can either stay with one XML layout file and add the recommended video layout marked with visibility="gone" — which will not trigger any measuring or rendering on the view — or we can duplicate the XML to keep a separation between experiments and our current production version. With more and more active experiments, the duplication-approach will make removing and finalizing versions easier.)

Now that we have prepared our Presenter, our View interface and both of our View implementations, we have to take care of the following:

  • Displaying the right view
  • Measuring our key performance indicators

Displaying the right view depends on multiple things. The first thing it depends on is how you are managing your user segmentation — there are multiple ways on how to approach this. Multiple services offer a randomized user segmentation, some offer it based on constraints (for example if a test is supposed to only be executed in certain countries or languages). For simplicity, let’s assume you are using Firebase Remote Configuration.

Creating an A/B Test (Experiment) in Firebase Remote Config. Firebase also allows us to specify how many of our total users should be included in the test.

In Firebase Remote Configuration you can create parameters for your test — in our example this will be recommended_video. Each of these keys will be evenly distributed among the users in the test segment. Other services allow to manually specify the conditions, in which case we would choose to segment users by random percentiles and split them at 50%.

In this example there are two variations of the test:

  • baseline
  • recommended_video
Specifying parameters and variants for our test.

So we would create a control group for baselinewhich covers the first 50% of the user segment. For recommended_video, add a variant covering the other 50%. Firebase will take care of the segmentation of your users.

Firebase also allows you to specify a certain goal which will be used to calculate the winning variation of your A/B Test. Other services might not offer this service, which is why we will also cover manual test evaluation later on.

Defining a Goal for our A/B Test in Firebase Remote Config.

After fetching and activating the configuration from Firebase you can retrieve the values through your firebase remote config instance:

This variable will either be baseline or recommended_video. (Since we only have two cases, you could also go for a boolean in this case and simplify the procedure.)

Depending on how you are handling navigation in your application, to display the correct screen a simple check is necessary:

That’s it. Make sure to initialize your testing framework as soon as possible. If you are fetching the user’s segment on-demand from a backend or service this will result in unnecessary delays — which can have a negative impact on your application’s user experience. Firebase will take care of caching the fetched values on the device, if you are using your own solution or one that does not support this you will also have to take care of this. Depending on your use-case it makes sense to add a fallback option (for example baseline) to make sure users are not stuck in an undefined state if they have no internet connection.

At this point, we have created our test and are displaying the right view with the right behavior to our test user segment. What’s left is measuring the impact of our new feature.

To measure the impact of the feature key performance indicators / conversion events have to be defined. In the case of our video-streaming application, this might be a video view. Each time a user clicks on a video and starts to watch it, an event will be fired to an analytics service of our choice (for example Firebase Analytics, Amplitude, …). This allows us to analyze how many videos are being watched over time, but this alone does not allow us to differentiate between our test user-segment.

Each of our users is in one of our two buckets — baseline and recommended_video. These are exclusive and bound to the user (as long as the test is running), which allows us to store them as user property and send the property to our analytics service. For our test with the video recommendation, we’ll create a user property called test_video_recommendation and set the value to the Stringwe’re receiving from Firebase (baseline vs. recommended_video). Everytime we fetch and activate the remote configuration from Firebase, we’ll update the user’s property value (since it might change once the test has concluded).

The user property allows us to group our analytics data and compare them. Additional filters might give us better results, depending on the conditions we have applied within the Firebase Remote Configuration. In our test case, we are only utilizing the random percentiles, which means the only choice we have is to filter by “new users” vs. “all users” (if your analytics service doesn’t provide this, you could also use a sign-up or onboarding event to determine if a user is a “new user” or not). Our feature is supposed to increase user engagement and retention. Since there is no clear conversion event like a purchase or subscription but instead retention is more important, we will compare all of our gathered data to determine if the test was succesfull or not.

To be able to draw conclusions from our test, we will let it run until we have achieved a certain statistical significance and trust in our results. This can be calculated with for example the significance calculator by kissmetrics by entering the amount of users in your user segment and the conversion events (in our case video views).

Firebase displaying the changes towards our defined goal metric grouped by variants. In this case, our goal was “Daily user engagement”.

Once we have trustworthy test results, we can go back to Firebase Remote Configuration and send all of our traffic to the winning variation. This will instantly happen for all new users, and gradually roll out for returning users of our application due to Firebase’s caching policy (depending on your set-up).

All that’s left for us is to clean up our code — which luckily due to our architecture — is pretty straight forward. In our case, the video recommendation clearly won and increased both user engagement as well as retention. Cleaning up the code, we are removing the VideoListActivity.kt, renaming the VideoListExperimentActivity.kt to match our naming conventions, we’ll remove the original layout XML file, and remove the navigation switch for this test.

If your application reaches a certain amount of users and your team grows, you will be able to execute multiple tests in parallel by defining multiple user segments. Keeping track of running tests and removing old tests has to be an important part of keeping the code quality high — once a test is concluded all of the losing variations become “unused code” and with 10–50 tests running at the same time, some of them with significant changes throughout the different layers of your application you will be glad you have chosen a flexible architecture for the application.

You can find the complete example on my GitHub profile:

Feel free to share your thoughts on this. If you have worked with A/B tests before, I’d appreciate if you would share your experience on separation between them with us.

--

--