The ultimate guide to more effective A/B testing on Google Play Store

Mateusz Wrzeszcz
17 min readApr 9, 2022

--

🚨 April 13th, 2022 UPDATE: Just this week, Google announced an update on how store listing experiments will run in Play Console, allowing for more control over the experiments plus improved statistical robustness (including the MDE calculator)!

Source: ASO Stack, Author: Maryna Assonova

This guide goes deep into the concept of MDE (Minimum Detectable Effect) being the new configuration parameter in Play Console, as well as other statistics, which should ease understanding of the implications MDE has on your tests.

What’s the guide about?

After Apple released their A/B testing tool called Product Page Optimization and I lost all hope that they will eventually provide us with a *usable* A/B testing tool for ASO, I realized it’s time to re-evaluate my experiment roadmaps and apologize to Google Play Store Experiments for all the bad words I’ve said about its reliability and flaws.

Even though it’s not entirely correct to call it an A/B testing tool, Play Store Experiments are still my go-to solution to quantitatively validate all assumptions I got in my testing backlog. And seeing how this is also the case for many other ASO teams, I suppose beggars can’t be choosers.

Consider this post the first in what will most likely be a recurring series of posts sharing my discoveries on how to run more effective experiments on Google Play Store and ensure their results are statistically valid. All of which have been honed through the past years of research and hands-on experience, improving the conversion rates for dozens of apps.

➡️ Understanding the Minimum Detectable Effect for ASO A/B testing

A proper understanding of MDE (Minimum Detectable Effect) and its implication for your experiments is crucial for designing and controlling the A/B testing process. With a solid grasp of MDE, you can understand what degree of change (and traffic volume) is required in particular markets to get more reliable results from your A/B tests.

MDE is a statistic most typically used to calculate the sample size required for A/B tests. While every sample size calculator needs MDE to send back results, sadly, not many guides detail how MDE impacts these calculators, which in my opinion is a significant overlook.

Below, is a screen capture from one of the most popular sample size calculator in the world of digital marketing, and the Minimum Detectable Effect which besides the Baseline conversion rate is a crucial configuration parameter:

Source: https://www.evanmiller.org/ab-testing/sample-size.html

However, we are going to save the calculator discussion for another time.

Instead, this post will focus on understanding MDE and its role in ensuring your experiments have enough statistical power to detect changes between your variants. I will share my discoveries and suggest a validated method that will empower you to utilize the MDE in your testing process for ASO — to prioritize the most valuable markets, decide on the required degree of differentiation between variants, and finally, calculate the minimum sample size.

How to interpret the Minimum Detectable Effect for A/B testing purposes?

MDE isn’t a complicated concept, but it can be tricky to understand just how it impacts A/B testing experiments, especially the ASO A/B testing.

So, I’ve split the definition of Minimum Detectable Effect to make things a bit more digestible.

1) Minimum Detectable Effect is the minimum relative difference in conversion rate between the default variant and treatment*, we want to be able to reliably** detect.

*the “Version B” of your test
**to a certain degree of statistical significance

2) Minimum Detectable Effect is used to specify the minimum expected improvement in CVR, below which the experiment is not worth running. Thus, making the potential positive impact of the experiment too small to devote time, effort, or money to.

To give you some context, MDE has significant application in more “sophisticated” tests, such as e-commerce platforms testing on the checkout flow or sign-up forms. Tests that require changes in the platform’s source code thus developer involvement, ultimately leading to additional costs, which can make the test not worth the time and money it needs.

Another example is the case of third-party testing tools, like SplitMetrics or StoreMaven. With these tools, there’s a specific volume of users needed for each test you run, who first have to be brought to the landing page (fake app store page) through paid campaigns and then redirected to the original store page/listing. In such a case, MDE can be measured against the costs required for using the tool plus the budget needed to drive enough traffic from paid campaigns vs. the expected ROI (Return Over Investment). ROI in that case simply means how much money you’d make if you manage to improve the CVR with this specific experiment to a desired level and bring in more users to your app.

Minimum Detectable Effect and native ASO testing

As long as it’s relatively easy to decide on the MDE when you use the third-party testing tools, in native A/B testing for ASO (Play Store Experiments & Product Page Optimization) the situation looks a bit different. The traffic which gets into experiments does not cost us anything (or is a cost you’d have to take on anyway) so calculating the ROI vs. costs spent on testing isn’t applicable.

Anyway, we still want to make sure that the results given by our testing tool are statistically valid, so now it’s time to lean towards the first part of the definition given above, which focuses more on the statistical validity of the observed results.

The lack of costs connected to native A/B testing mentioned above doesn’t mean it’s completely free to do so (and I’m firmly against treating native A/B testing as a cost-free tool). Still, the resources needed for native ASO testing are significantly lower than any project, including running paid campaigns or implementing costly changes in the software source code.

Being aware of the above and knowing that MDE is “the minimum relative difference in conversion rate”, some may ask: “why not simply aim for the slightest possible change in CVR?” For example, MDE=1%, as it’s evident that even a minor change of CVR will be valuable from the performance perspective (imagine that your app generates even 100k downloads each month, then 1% relative change in CVR will bring 10.000 additional downloads per year!).

And that’s where usually the first confusion appears…

Obviously, most developers would like to implement a variant bringing even the most negligible possible difference in CVR because each lift at the end brings in additional users to our app. However, if you want to make sure that the tool you use can catch minor, reserved changes in the design, and still guarantee that the results you get are statistically valid, the number of Installers that has to be included in the test to ensure proper accuracy would need to be prohibitively large.

In plain words, the reason for this is that, if the difference between two variants is minor, (e.g. you decide to change the color of the phone mockup frame you use to present screens from the app), users’ behavior doesn’t differ much between the two samples. To provide a settled confidence level, many Installers are needed to compare the samples and give back the data on each variant’s performance — in statistical terms, we’re talking here about a concept called “statistical power”, which we’d briefly touch on the next section.

In a proper statistic test, to precisely control your experiments, you have to remember about the statistical power, which is being explained by Georgi Georgiev, the Founder of Analytics Toolkit as the “ability to detect a difference between test variations when a difference actually exists”.

Following the before mentioned expert, statistical power is a function of its sample size and confidence level showing the probability to reject the null hypothesis over all possible values of the parameter of interest (in our case - CVR).

Here is a great example created by the before-mentioned author, depicting the difference between low and high-powered experiments.

Source: https://blog.analytics-toolkit.com/2017/importance-statistical-power-online-ab-tests/

A commonly used value for statistical power is 80%, which means that the test has an 80% chance of detecting a difference equal to the Minimum Detectable Effect. As mentioned previously, a test has a lower probability of detecting smaller lifts and a higher probability of detecting larger lifts.

What’s the easiest way to impact the statistical power? Increase your sample size or the estimated effect you’re trying to detect — the before mentioned Minimum Detectable Effect.

In order to not complicate the understanding of MDE, I’m not going into the nitty-gritty of statistical power. If you want to dive deeper into the concept, check this great article from Georgi Georgiev.

What is crucial for you to take away from this concept is:

The smaller the implemented difference to your treatment is, the more installers you need to be able to reliably detect the change in variants’ performance.

Therefore, “playing” with MDE is about finding a trade-off between implementing minor, hardly noticeable changes and running the experiments for a prohibitively long period of time.

👉 If you want to make sure your experiment’s purpose is sensible, always ask yourself these three simple questions:

Is the change strong enough to reach visitors’ minds?

Will visitors notice any difference in the first 3 seconds of looking at the store listing assets?

Is the change designed to impact visitors’ willingness to download the app, or is it only an aesthetic adjustment?

If you answered “yes” to all the above questions, you are on the right track to getting the most out of your experiments.

Remember that majority of your visitors don’t work in the design/marketing fields and unless the change is really undeniable and outstanding, there’s a high chance no one will catch it.

Practical application of the MDE in testing

To help you better understand the concept, below I pasted a graphic from a sample size calculator I used in the past, which represents the relation between MDE and the time needed to reliably detect the changes for your experiment.

Here is the screenshots experiment for a fake app, called A$O5:

A$O5 is a well-established and moderately popular app, generating approx. 350k downloads per year in the UK itself.

Daily performance details:
Store listing visitors: 2500 per day,
Store listing acquisitions: 1000 per day,
Conversion Rate: 40%.

Source: https://abtestsamplesize.herokuapp.com/

As you can see, in 7 days* we’re able to reliably detect the winning variant only if, the relative change in CVR is not less than 4.7%.

*(to be exact, 6.7 days, but it’s a good practice to round the length of the experiments for full weeks to cover the most typical business cycle for mobile apps)

The amount of days needed to reach a particular MDE level is calculated based on A$05 current traffic and the required sample size (16912 visitors).

Calculations:
Required Sample Size / Daily Visitors = Days Needed to Reach the Required Sample Size

16912 / 2500 = 6.76 ~ 7 days

If your app, like A$05, is well-established and you’ve already made some optimization efforts, achieving a 4.7% relative change in CVR will require drastic and bold changes in the design (in practice, is almost impossible). This, in most cases, is not a result you can achieve by changing the background color, slightly playing with the wording, or changing order between the 6th and 7th screenshot, but rather a bold redesign of the whole set and/or adopting a completely different design style direction.

👉 If you haven’t been aware so far that reserved, granular, and risk-averse testing requires enormous volumes of visitors, you really should consider reevaluating your experiments’ backlog!

So, unless your app is getting an enormous volume of traffic, you can’t be timid with your testing ideas. If you want to balance the time needed for tests with detecting meticulous changes, you have to go bold!

Below I’m presenting my subjective opinions on some of the tests I managed to find using AppTweak’s “timeline” feature to give you a better understanding of what I consider a low potential, reserved, and bold experiment.

Examples of tests that had a relatively low potential to provide conclusive results (predicted effect on CVR <1%):

  • changing the order of the screens visible in the first-impression frame
An experiment HBO MAX did some time ago in the US. Since all four screens are visible without scrolling, it makes no considerable difference in what order you’d position them. Though it would make a difference on the App Store, as users initially see only the 1st and half of the 2nd screenshot, not four like it is on the Play Store.
  • changing very minor elements which don’t have a lot of potential to impact visitors’ motivation to download the app:
Can you see the difference between these two variants? This experiment run by Amazon Music probably didn’t have much chance to be conclusive due to how minor the implemented change was. (If you still can’t find the difference, look once again at the captions in the first screenshot)
No matter how significant a percentage of Adidas audience are complete sneaker-heads, changing the sneaker model visible on the first screen also didn’t have the high potential to provide conclusive results. Even if the sneaker was a limited version or the best-selling shoe of the month, behavior of the majority of users probably wouldn’t differ much.
Nike removed the black phone frames from their screenshot set and reverted the background colors of the 3rd and 4th screenshots. Again, such an adjustment has a very low potential for being a meaningful change impacting visitors’ willingness to download the app, hence having low chances of getting conclusive results.

Examples of tests I consider reserved (predicted effect on CVR >1%, <3%) — which definitely are worth the effort if you have enough traffic (or a lot of time) since they can provide granular insights on specific design element change or style direction):

  • changing only one screen in the first-impression frame (screenshots visible without scrolling)
GetUpside added a totally new screenshot on 1st position, pushing the remaining screenshots back in the order. This is an excellent example of iterative testing, which can validate certain assumptions on specific design elements. Here: the importance of the “cash back” feature.
  • redesigning the set aesthetics without changing the main message
PayPal tested screenshots with a very different design style than their default image set, but kept the main message, screenshots’ order, and focus on features the same as the original variant.
Adidas probably noticed that their previous experiment (listed above) didn’t have much chance to bring conclusive results. So they’ve done the homework and tested a more differentiated (although still pretty reserved) variant. Aesthetics changed, but the messaging & presented features remained the same.

Examples of tests I consider bold: (predicted effect on CVR>3%) — which should be your go-to solution if you’re not a well-known brand and can’t provide tons of traffic to your app each day of the experiment, or you simply don’t want your tests to take months):

Coinbase went really bold with one of their recent experiment, which is an excellent example of how diversified your test’s variants should get if you expect a high effect on CVR. It’s not just about the background colors, but the whole messaging, design style, and features order, which has significantly changed.
Clue also decided to test experiment variants focused on entirely different features, presented a new design style (with prominent pop-outs, modified backgrounds), and even changed the default screenshot shape.
Spotify tested the red/orange variant against the default image set. For a long time, I wasn’t sure how we should look at this test — whether it was a bold or actually a reserved test, but eventually, I’d categorize it as a semi-bold one because both — the aesthetics and messaging were modified to a distinct degree.

Given that even the most popular apps like Coinbase, Clue, or Spotify generating thousands of downloads each day, test bold design assumptions and completely redesign their screenshots, would you still find it reasonable to test a single design element change at a time or a granular aesthetic adjustment with a less popular app?

I’m not saying that you should give up iterative testing altogether, but you should at least try to diversify your roadmaps with bold, highly differentiated & more reserved, iterative tests.

Also, keep in mind that the primary goal of A/B testing should be to quantitatively validate qualitative assumptions on your audience and not to blindly shot in the dark and count on dumb luck.

👉 Rather than trying to understand how a singular design element can impact your CVR, base your testing roadmaps on validating solid hypotheses built on strong qualitative insights.

👉 Even if you don’t have any qualitative insights yet, try leveraging what other teams in your organization have learned from areas like performance marketing or CRM, to support your hypothesis.

⚠️ Google has recently pre-announced changes to Play Console, which are meant to add “statistical robustness and additional experiment configuration options”. However, setting MDE in the experiment configuration will not directly impact the way in which your experiment is conducted. Its main purpose is only to calculate the estimated time needed for the experiment to end. ⚠️

Where to find the ‘estimated effect’ in Play Store Experiments?

To understand the MDE and its implications on your experiments, first you have to know where to look for the ‘estimated effect’ your experiments have.

As a starting point, I suggest you check past experiments in Play Console library and try to understand what’s the typical effect on conversion for specific listing elements, such as screenshots, featured graphic, or short description. Stick only to the experiments, which reached statistical significance — these are all experiments where Google Play Store displayed any of these ‘recommendations’):

  • “Variant X performed best.”
  • “Current listing performed best.”
  • “All variants performed similarly to your current store listing.”

Be mindful though, that this is not always the most optimal approach, since you have no guarantee whether your past experiments were conducted properly. Especially if you’ve reached your sample size* (you only know that they’ve reached a 90% confidence level, providing that you based on experiments which gave the specific ‘recommendations’ mentioned above).

*More about statistical significance & reaching sample size in this post.

Remember that experiments that produce negative or inconclusive results are as important as tests with positive ones. There is always insight to be gained from a well-run experiment, even if it “fails”.

Additionally, you can exclude all experiments that haven’t lasted for at least full 14 days (7 if you’re a big-name app) to provide another level of assurance that your ‘estimated effect’ was valid.

The ‘estimated effect’ of the experiment will be the average value, between the lower and higher bounds of the given performance range.

In the example given above, you should calculate the average between lower (-1.4%) and higher (+4.8%) bounds. In our case, that’s 1.7% which is the estimated effect that this variant may have on CVR if it was applied.*

*more on how to interpret the performance bar in future publications.

What MDE value to adopt if you have no valid historical data?

It can happen, that you don’t have any experiments which reached statistical significance or were conducted correctly in your directory. If that’s the case, below you can check the general values you can expect from your experiments:

(All values present relative (not absolute) difference of CVR.)

Newly launched app, with no previous optimization efforts:

  • distinct change to the icon: 5% - 25%
  • distinct change to the screenshots 1–4: 5% - 20%
  • distinct change to the screenshots 5–10: 1% - 5%
  • redesign of the whole screenshots set: 5% - 25%
  • distinct change to the featured graphic (only if the video is added):
    5% - 20%
  • adding/removing the app preview video: 5% - 20%

Well-established app, already in a process of conversion optimization:

  • distinct change to the icon: 1% - 4%
  • distinct change to screenshots 1–4: 1.5% - 3%
  • distinct change to screenshots 5–10: >1.5%
  • redesign of the whole screenshots set: 2% - 6%
  • distinct change to featured graphic (only if the video is added): 1.5% - 5%
  • adding/removing the app preview video: 1% - 5%

Remember that these values are only indicative, based on experiments I run as an ASO specialist. Even though I tried to average them with results from various apps and industries, your app has a different audience base. Also, the screenshots might be more or less optimized; therefore, your typical estimated effect may be completely different from mine. Consequently, it’s impossible to predict the exact impact the experiment’s variants may have on CVR — if we knew it before, we wouldn’t need A/B testing at all.

Bonus tip for testing screenshots (since you’ve came so far in this article)

As you notice, in the list above, depending on which screenshot you decide to implement changes to, the MDE significantly changes. This indicates, how important it is, to prioritize the set of screenshots visible on your app page without scrolling (a.k.a. first impression frame, similarly to the ‘above the fold’ rule in desktop A/B tests).

The reason for that is pretty straightforward — most users never scroll through your screenshots (according to the most recent studies done by StoreMaven, as little as 4% of users scroll through the portrait screenshots gallery on average). Of course, this number can significantly differ depending on the app’s specific audience or traffic split. Still, I’d say it’s reasonable to assume that majority of users generally don’t scroll through the screenshots.

PlutoTV and Snapchat store listing pages in Google Play Store.

Considering how few users interact with the screenshots visible after scrolling, plus, the fact that Play Store doesn’t validate if visitors have seen your experiment’s variables, it doesn’t make much sense to test changes introduced to the screenshot positioned in the middle (or end) of your set, no matter what volume of traffic your app gets.

Building your own MDE Library

If you’re serious about your A/B testing process, it makes sense to lead the MDE library regularly, gathering all your experiment results with estimated impact per each market and element. Thanks to this approach, you’d be able to create even more accurate estimates of what MDE you can expect from your experiments.

Don’t stress too much about the accuracy though. It’s not about getting it exactly right, but understanding how a certain degree of changes impacts your CVR).

Example of the simple MDE Library run on a frequent basis

Now that (I hope) you got a better understanding of MDE and where to look for it, you’re ready to prioritize & regroup experiments in your backlog, based on your traffic & conversion rate.

A good idea would be to calculate the required minimum sample size (e.g. utilizing the calculator I mentioned above) and group the markets you run tests on into buckets and plan the whole ideation & creation process accordingly:

Example of a theoretical grouping based on markets bandwidth

Obviously, each market is a unique set combined of CVR and traffic volume, so the possibilities are endless (e.g. you can work with high traffic market, but with extremely low CVR and the opposite) making it essential to calculate the sample size and understand your market’s capacity.

Optimizely, a well-known platform for desktop A/B testing, does a great job of explaining this concept and suggests using the MDE for prioritization purposes in a straightforward method:

Rather than trying to get your MDE exactly right, use it to set boundaries for your experiment so you can make informed business decisions. With a more nuanced understanding of how MDE affects sample size and goals, you can decide when to keep running an experiment given certain operational constraints.

Use it to benchmark how long to run an experiment and the impact you are likely to see, so you can prioritize experiments according to expected ROI. Depending on how granular you want your results to be, you can set expectations for how long it may take to run an experiment based on MDE.

🧠 Key takeaways from the text: 🧠

  • Avoid testing changes on screenshots that are not visible without scrolling (unless that’s a part of a whole set redesign or you know from prior research that a high percentage of your audience scrolls through the screenshots),
  • Generally aim for testing bold, distinct changes, as it’s easier to get reliable results from bigger modifications than meticulous changes. Reserved, risk-averse testing requires vast volumes of visitors in your experiment.
  • Treat MDE as a handy tool to prioritize your experiments and understand the degree of changes needed per each market.
  • If you have the app preview added, prioritize testing the featured graphic, as the rest of the screenshots are not entirely visible without scrolling.
  • Test the featured graphic only when you have the app preview added. After April’s 2018 update to Google Play Store, it’s no longer a commonly displayed element on the top of your store listing page.
  • Few people scroll through the screenshots and even less read the full description. So, don’t waste your time testing this element unless that’s to get keyword ranking benefits.
  • Always calculate your Minimum Sample Size per market and try to reach it while running experiments. Do not end the test as soon as GPE gives you a recommendation.

Remember! ⚠️

Google Play Store Experiments, like many others A/B testing tools, do not validate if your experiment reached the required sample size, it only displays recommendations once your experiments reached statistical significance, which is not a stopping rule! (more about it in this post)

If you want to know how to calculate the sample size for Play Store experiments (utilizing the knowledge you got today about MDE) follow my next publications for more tips.

As this is my first post, I’d highly appreciate sharing your honest thoughts and suggestions as well as constructive criticism on how can I improve the quality of content shared with you :)

--

--