Product Optimization: A/B Testing with native mobile apps for iOS and Android

Paul Hackenberger
Axel Springer Tech
Published in
10 min readFeb 14, 2023
A/B Testing with Google Firebase

A/B testing is an established method for product optimization, but it’s hard to do right for native mobile apps.

Usually one or more commercial tools are involved, and then an ongoing series of A/B tests are being run, sometimes even by a specialized A/B testing team.

The main purpose of the A/B testing is to optimize the product funnel, but also for other reasons, like product discovery or optimizing specific user KPIs, for example push subscriptions.

In web you can easily inject custom JavaScript and then dynamically adopt the appearance according to the A/B testing targets at runtime of the page in the browser. All this can happen almost completely independent of the main team developing the web page.

Native apps with native iOS Swift or Android Kotlin code can’t be changed (easily) at runtime, especially after having passed the app store review, being released in the app stores and running in the mobile OS sandbox installed on the users device.

Even though it’s not an easy call, it is still possible to do some A/B testing with mobile native applications, depending on your requirements and use case.

In the following, I would like you to inform you about the main types of A/B testing that you can use for product optimization with native apps for Android and iOS.

Synthetic Control Method

The Synthetic Control Method aims to estimate the counterfactual performance of an application — as if it had not undergone any changes — by creating a synthetic version of the app from a combination of multiple untreated units with similar characteristics. This estimated data is then compared with the actual data from the modified application.

Impressive formular right?

Following the release of a new app version with additional features, its real-world performance metrics are compared against the projections made by the Synthetic Control Method. This comparison helps in drawing conclusions about the impact of the new features. This method is particularly effective for applications whose performance is not significantly influenced by external events and exhibits a stable, predictable pattern of data.

AS National Media and Tech is working in the news media environment, where we see recurring working-time and night-time patterns, interrupted by push message peaks. Meanwhile the main patterns are recurring, a special news event — like the discovery of alien life 👽 on a far far away planet — could totally change the default pattern. Those changes of pattern can’t be reflected by the Synthetic Control Method, and could therefore lead to wrong conclusions.

Typical two day usage statistic with night/day change and push peaks

Double Feature Implementation

The naive approach was to implement all the alternate feature implementation options you want to test in the application, and then control the appearance and displayed option per configuration.

You then compare the statistics of the user group with test feature A enabled versus the user group with test feature B.

The disadvantage of this approach is the requirement to fully specify, design, implement and test two versions of the same feature and discarding the feature which proved to be less performant in the test.

Developer time is expensive and velocity in feature output is crucial on a fast moving market; so this was not the way for us to go to.

Implement feature for one OS only — and compare

Similar to the phased release, you could implement one feature on one OS only, and see the impact on the figures compared to the OS untouched.

We don’t use this type explicitly. But sometimes the development speed of the native teams iOS and Android differ, and if one team makes faster progress, we might decide to release a feature for one OS only first.

Of course in this case we watch the numbers and check the success of the feature or change!

Beta Test Group/UX Lab

Obvious call: Play new features first on your beta tester or crowd tester group in any implementation depth (teaser, prototype, MVP, …) and collect usage stats plus user feedback. All this can be done based on the same data being provided to users in production and beta users with additional features.

If you are running an Internal Test Group with Apple TestFlight, you even don’t have to wait for the Beta Review by Apple, that is mandatory for Public Test Groups.

Android even has less restrictions with Early Access or Beta Testing Groups.

If you can afford it, you could also go and assemble feedback on paper prototype or click dummy prototype via an UX lab and eye tracker!

Of course, we do that!

Prototype Implementation plus Poll

Similar to the double feature implementation, you could just place a hint or button to a new feature in the app, without completely implementing the feature, to check if any user was interested at all in the new feature.

You also could add a poll, describing the planned feature and asking the user for her expectations.

It strongly depends on your user group, if such approaches lead to more engaged users, looking for the real implementation of an interesting feature — or if you trigger product disappointment and the feeling being an abused not-payed beta tester for a banana product, that ripes at the customer…

Phased Release/Staged Rollouts Testing

Both the iOS and Android app stores allow phased releases or staged rollouts. This gives you a free A/B testing capability out of the box!

Meanwhile you release a new version the adoption of the new version will not be instant, especially when you delay the distribution via the phased release.

Just looking at the statistics and comparing the statistics of the different app versions, you can compare the performance of the new with the existing application, with the same data (sic!) without changing anything.

Because the Phased Release Testing comes practically without cost and does neither involve additional coding nor has a negative impact on our customers, we do this type of passive testing per default with each release.

Feature Toggles and Config Options

An promising approach, especially when you are writing modular apps, is to use feature toggles, to hide or show features on runtime to specific users.

You may even combine feature toggles with feature config options, to not only show/hide specific features, but also provide feature-individual configuration options, like the number of elements being displayed, the colors used or the texts shown for a specific feature.

It requires some additional architectural and design considerations, but once done right, you can run pretty interesting A/B tests in different combinations of features and feature config options.

There are some mature tools and services available that might prove helpful in this matter.

I am personally a big fan of Google Firebase services (greetings to Luiz Gustavo, who now unfortunately switched to TensorFlow), that took over providing all basic services required to run apps in a similar fashion like Parse did before.

The interesting services in our context are Firebase Remote Config in combination with Firebase A/B-Testing.

With Firebase Remote Config you can specify app parameters and update them on the fly, to change the behavior and appearance of your app without publishing an app update.

It’s important to know that for performance reasons, configuration updates per default are triggered by client side SDK and happen only all 12h. If you need real-time updates, you might want to read about the propagation of Remote Config updates in real time.

With Firebase Remote config you can organize and provide all parameters required for feature toggles and feature configuration options.

Our team wrote a little helper to retrieve default values at compile time, that might be useful for you as well.

Now in combination with Firebase A/B-Testing you can run experiments with different user segments with different parameters and compare with the baseline concerning the impact on specified KPIs.

If the experiment is winning over the baseline, you can keep the current configuration — and roll back in case it had no or even negative impact.

Firebase Remote Config changes are versioned, which proved to be useful in case of a rollback.

My advice would be to not only look and optimize specific parameters, but always keep an eye on your global most important KPIs also with every experiment. It’s contra-productive if you optimize a little tracking parameter, but are globally loosing revenue by an short-sighted optimization!

Legal Considerations

To capture the KPIs required for experiment evaluation, the usage of Google Analytics for Firebase is required; in fact Google Analytics is the heart of all Firebase services.

You should check with your legal department, which setup was appropriate to conform to the GDPR and DSGVO European and German laws, or the CCPA in the states, before implementing it.

Greetings to our former abogado Matthias Horn, who helped us a lot in understanding the legal options and limitations!

Automatic personalization

On top of the manually-driven experiments, Firebase also started a service that can automatically and user-individually optimize the parameters to find the optimal parameter combination for each user.

We didn’t give a shot to it yet, but it sounds like a really interesting approach on automated, individual product optimization.

Think about it: Instead of globally optimizing a product, with automatic personalization you are coming to something like user-specific optimization, that results in a completely new aspect of how to plan, execute and survey product optimizations.

I would image at the end of the day, implicitly automatic user segments will be created using the same config. But how to track and understand which segments are beeing created and why the set of configuration was optimal for them?

I would be very interested, if anyone put this into production and can tell me some interesting findings!

Server-driven UI and Dynamic Endpoints

Endpoint configuration via Firebase Remote Config

The implementation of feature toggles and feature options, using the right tools, are extremely powerful when it comes to A/B-Testing.

But for specific use-cases, there are even more flexible options available.

In our news apps we are following the Server-Driven UI (SDUI) principle, like many other companies, that partly also publish their libs as open source:

Server-driven UI comes with some complexity, but for specific use-cases it perfectly makes sense.

We are providing News Apps for BILD and WELT newspaper, that mainly display stages and articles with recurring elements.

You may visualize the elements as LEGO bricks (yes, I still love LEGO!) that you can put together in any way. The more generic the LEGO bricks are, the more creative you can get with different builds.

Using the SDUI approach we can easily adopt any layout changes of article and stages as long as the LEGO bricks itself do not change, and we can do those adaptions in the backend without app releases.

Server-drive UI works for visual elements like stacking LEGO bricks and allow any combination

Not only for layout changes we don’t need app releases, but also for some functional changes: using SDUI we can push much more of the business logic in the backend, resulting in possible fast changes without app development or release.

On top of that, all business logic, that you push to the backend does not need to be implemented, tested and released twice: for Android and iOS, but in the backend only!

This enables us moving fast and apply a lot of changes on the fly, while focussing on user experience and features with our app teams. It also has a positive impact on the team and it’s members: we become more and more a cross-functional team and the mobile developers get more and more full-stack developers with mobile and backend!

Coming back to the A/B-Testing: The combination of SDUI and Firebase Remote Config just works like a charm.

Beside feature toggles and feature options, with SDUI we are now able to put also endpoint configurations in Firebase Remote Config.

We now have the power to play completely different article or stage layouts, logic or personalization options, with just implementing a different logic in our backend and publishing a new endpoint.

And since the endpoints are configured in Firebase Remote Config, we can run different experiments with Firebase A/B-Testing with just a few clicks!

Summary

Don’t let you stop by the native apps nature, of unchangeable code at runtime and the almighty app store review and release barrier.

A/B testing for product optimization will still be possible, if you think about how to use the possibilities that the native app environment is providing you out of the box, and by active decisions how to architect your app to enable dynamic A/B testing in future.

--

--