The ultimate guide to more effective A/B testing on Google Play Store
đ¨ April 13th, 2022 UPDATE: Just this week, Google announced an update on how store listing experiments will run in Play Console, allowing for more control over the experiments plus improved statistical robustness (including the MDE calculator)!
This guide goes deep into the concept of MDE (Minimum Detectable Effect) being the new configuration parameter in Play Console, as well as other statistics, which should ease understanding of the implications MDE has on your tests.
Whatâs the guide about?
After Apple released their A/B testing tool called Product Page Optimization and I lost all hope that they will eventually provide us with a *usable* A/B testing tool for ASO, I realized itâs time to re-evaluate my experiment roadmaps and apologize to Google Play Store Experiments for all the bad words Iâve said about its reliability and flaws.
Even though itâs not entirely correct to call it an A/B testing tool, Play Store Experiments are still my go-to solution to quantitatively validate all assumptions I got in my testing backlog. And seeing how this is also the case for many other ASO teams, I suppose beggars canât be choosers.
Consider this post the first in what will most likely be a recurring series of posts sharing my discoveries on how to run more effective experiments on Google Play Store and ensure their results are statistically valid. All of which have been honed through the past years of research and hands-on experience, improving the conversion rates for dozens of apps.
âĄď¸ Understanding the Minimum Detectable Effect for ASO A/B testing
A proper understanding of MDE (Minimum Detectable Effect) and its implication for your experiments is crucial for designing and controlling the A/B testing process. With a solid grasp of MDE, you can understand what degree of change (and traffic volume) is required in particular markets to get more reliable results from your A/B tests.
MDE is a statistic most typically used to calculate the sample size required for A/B tests. While every sample size calculator needs MDE to send back results, sadly, not many guides detail how MDE impacts these calculators, which in my opinion is a significant overlook.
Below, is a screen capture from one of the most popular sample size calculator in the world of digital marketing, and the Minimum Detectable Effect which besides the Baseline conversion rate is a crucial configuration parameter:
However, we are going to save the calculator discussion for another time.
Instead, this post will focus on understanding MDE and its role in ensuring your experiments have enough statistical power to detect changes between your variants. I will share my discoveries and suggest a validated method that will empower you to utilize the MDE in your testing process for ASO â to prioritize the most valuable markets, decide on the required degree of differentiation between variants, and finally, calculate the minimum sample size.
How to interpret the Minimum Detectable Effect for A/B testing purposes?
MDE isnât a complicated concept, but it can be tricky to understand just how it impacts A/B testing experiments, especially the ASO A/B testing.
So, Iâve split the definition of Minimum Detectable Effect to make things a bit more digestible.
1) Minimum Detectable Effect is the minimum relative difference in conversion rate between the default variant and treatment*, we want to be able to reliably** detect.
*the âVersion Bâ of your test
**to a certain degree of statistical significance
2) Minimum Detectable Effect is used to specify the minimum expected improvement in CVR, below which the experiment is not worth running. Thus, making the potential positive impact of the experiment too small to devote time, effort, or money to.
To give you some context, MDE has significant application in more âsophisticatedâ tests, such as e-commerce platforms testing on the checkout flow or sign-up forms. Tests that require changes in the platformâs source code thus developer involvement, ultimately leading to additional costs, which can make the test not worth the time and money it needs.
Another example is the case of third-party testing tools, like SplitMetrics or StoreMaven. With these tools, thereâs a specific volume of users needed for each test you run, who first have to be brought to the landing page (fake app store page) through paid campaigns and then redirected to the original store page/listing. In such a case, MDE can be measured against the costs required for using the tool plus the budget needed to drive enough traffic from paid campaigns vs. the expected ROI (Return Over Investment). ROI in that case simply means how much money youâd make if you manage to improve the CVR with this specific experiment to a desired level and bring in more users to your app.
Minimum Detectable Effect and native ASO testing
As long as itâs relatively easy to decide on the MDE when you use the third-party testing tools, in native A/B testing for ASO (Play Store Experiments & Product Page Optimization) the situation looks a bit different. The traffic which gets into experiments does not cost us anything (or is a cost youâd have to take on anyway) so calculating the ROI vs. costs spent on testing isnât applicable.
Anyway, we still want to make sure that the results given by our testing tool are statistically valid, so now itâs time to lean towards the first part of the definition given above, which focuses more on the statistical validity of the observed results.
The lack of costs connected to native A/B testing mentioned above doesnât mean itâs completely free to do so (and Iâm firmly against treating native A/B testing as a cost-free tool). Still, the resources needed for native ASO testing are significantly lower than any project, including running paid campaigns or implementing costly changes in the software source code.
Being aware of the above and knowing that MDE is âthe minimum relative difference in conversion rateâ, some may ask: âwhy not simply aim for the slightest possible change in CVR?â For example, MDE=1%, as itâs evident that even a minor change of CVR will be valuable from the performance perspective (imagine that your app generates even 100k downloads each month, then 1% relative change in CVR will bring 10.000 additional downloads per year!).
And thatâs where usually the first confusion appearsâŚ
Obviously, most developers would like to implement a variant bringing even the most negligible possible difference in CVR because each lift at the end brings in additional users to our app. However, if you want to make sure that the tool you use can catch minor, reserved changes in the design, and still guarantee that the results you get are statistically valid, the number of Installers that has to be included in the test to ensure proper accuracy would need to be prohibitively large.
In plain words, the reason for this is that, if the difference between two variants is minor, (e.g. you decide to change the color of the phone mockup frame you use to present screens from the app), usersâ behavior doesnât differ much between the two samples. To provide a settled confidence level, many Installers are needed to compare the samples and give back the data on each variantâs performance â in statistical terms, weâre talking here about a concept called âstatistical powerâ, which weâd briefly touch on the next section.
In a proper statistic test, to precisely control your experiments, you have to remember about the statistical power, which is being explained by Georgi Georgiev, the Founder of Analytics Toolkit as the âability to detect a difference between test variations when a difference actually existsâ.
Following the before mentioned expert, statistical power is a function of its sample size and confidence level showing the probability to reject the null hypothesis over all possible values of the parameter of interest (in our case - CVR).
Here is a great example created by the before-mentioned author, depicting the difference between low and high-powered experiments.
A commonly used value for statistical power is 80%, which means that the test has an 80% chance of detecting a difference equal to the Minimum Detectable Effect. As mentioned previously, a test has a lower probability of detecting smaller lifts and a higher probability of detecting larger lifts.
Whatâs the easiest way to impact the statistical power? Increase your sample size or the estimated effect youâre trying to detect â the before mentioned Minimum Detectable Effect.
In order to not complicate the understanding of MDE, Iâm not going into the nitty-gritty of statistical power. If you want to dive deeper into the concept, check this great article from Georgi Georgiev.
What is crucial for you to take away from this concept is:
The smaller the implemented difference to your treatment is, the more installers you need to be able to reliably detect the change in variantsâ performance.
Therefore, âplayingâ with MDE is about finding a trade-off between implementing minor, hardly noticeable changes and running the experiments for a prohibitively long period of time.
đ If you want to make sure your experimentâs purpose is sensible, always ask yourself these three simple questions:
Is the change strong enough to reach visitorsâ minds?
Will visitors notice any difference in the first 3 seconds of looking at the store listing assets?
Is the change designed to impact visitorsâ willingness to download the app, or is it only an aesthetic adjustment?
If you answered âyesâ to all the above questions, you are on the right track to getting the most out of your experiments.
Remember that majority of your visitors donât work in the design/marketing fields and unless the change is really undeniable and outstanding, thereâs a high chance no one will catch it.
Practical application of the MDE in testing
To help you better understand the concept, below I pasted a graphic from a sample size calculator I used in the past, which represents the relation between MDE and the time needed to reliably detect the changes for your experiment.
Here is the screenshots experiment for a fake app, called A$O5:
A$O5 is a well-established and moderately popular app, generating approx. 350k downloads per year in the UK itself.
Daily performance details:
Store listing visitors: 2500 per day,
Store listing acquisitions: 1000 per day,
Conversion Rate: 40%.
As you can see, in 7 days* weâre able to reliably detect the winning variant only if, the relative change in CVR is not less than 4.7%.
*(to be exact, 6.7 days, but itâs a good practice to round the length of the experiments for full weeks to cover the most typical business cycle for mobile apps)
The amount of days needed to reach a particular MDE level is calculated based on A$05 current traffic and the required sample size (16912 visitors).
Calculations:
Required Sample Size / Daily Visitors = Days Needed to Reach the Required Sample Size
16912 / 2500 = 6.76 ~ 7 days
If your app, like A$05, is well-established and youâve already made some optimization efforts, achieving a 4.7% relative change in CVR will require drastic and bold changes in the design (in practice, is almost impossible). This, in most cases, is not a result you can achieve by changing the background color, slightly playing with the wording, or changing order between the 6th and 7th screenshot, but rather a bold redesign of the whole set and/or adopting a completely different design style direction.
đ If you havenât been aware so far that reserved, granular, and risk-averse testing requires enormous volumes of visitors, you really should consider reevaluating your experimentsâ backlog!
So, unless your app is getting an enormous volume of traffic, you canât be timid with your testing ideas. If you want to balance the time needed for tests with detecting meticulous changes, you have to go bold!
Below Iâm presenting my subjective opinions on some of the tests I managed to find using AppTweakâs âtimelineâ feature to give you a better understanding of what I consider a low potential, reserved, and bold experiment.
Examples of tests that had a relatively low potential to provide conclusive results (predicted effect on CVR <1%):
- changing the order of the screens visible in the first-impression frame
- changing very minor elements which donât have a lot of potential to impact visitorsâ motivation to download the app:
Examples of tests I consider reserved (predicted effect on CVR >1%, <3%) â which definitely are worth the effort if you have enough traffic (or a lot of time) since they can provide granular insights on specific design element change or style direction):
- changing only one screen in the first-impression frame (screenshots visible without scrolling)
- redesigning the set aesthetics without changing the main message
Examples of tests I consider bold: (predicted effect on CVR>3%) â which should be your go-to solution if youâre not a well-known brand and canât provide tons of traffic to your app each day of the experiment, or you simply donât want your tests to take months):
Given that even the most popular apps like Coinbase, Clue, or Spotify generating thousands of downloads each day, test bold design assumptions and completely redesign their screenshots, would you still find it reasonable to test a single design element change at a time or a granular aesthetic adjustment with a less popular app?
Iâm not saying that you should give up iterative testing altogether, but you should at least try to diversify your roadmaps with bold, highly differentiated & more reserved, iterative tests.
Also, keep in mind that the primary goal of A/B testing should be to quantitatively validate qualitative assumptions on your audience and not to blindly shot in the dark and count on dumb luck.
đ Rather than trying to understand how a singular design element can impact your CVR, base your testing roadmaps on validating solid hypotheses built on strong qualitative insights.
đ Even if you donât have any qualitative insights yet, try leveraging what other teams in your organization have learned from areas like performance marketing or CRM, to support your hypothesis.
â ď¸ Google has recently pre-announced changes to Play Console, which are meant to add âstatistical robustness and additional experiment configuration optionsâ. However, setting MDE in the experiment configuration will not directly impact the way in which your experiment is conducted. Its main purpose is only to calculate the estimated time needed for the experiment to end. â ď¸
Where to find the âestimated effectâ in Play Store Experiments?
To understand the MDE and its implications on your experiments, first you have to know where to look for the âestimated effectâ your experiments have.
As a starting point, I suggest you check past experiments in Play Console library and try to understand whatâs the typical effect on conversion for specific listing elements, such as screenshots, featured graphic, or short description. Stick only to the experiments, which reached statistical significance â these are all experiments where Google Play Store displayed any of these ârecommendationsâ):
- âVariant X performed best.â
- âCurrent listing performed best.â
- âAll variants performed similarly to your current store listing.â
Be mindful though, that this is not always the most optimal approach, since you have no guarantee whether your past experiments were conducted properly. Especially if youâve reached your sample size* (you only know that theyâve reached a 90% confidence level, providing that you based on experiments which gave the specific ârecommendationsâ mentioned above).
*More about statistical significance & reaching sample size in this post.
Remember that experiments that produce negative or inconclusive results are as important as tests with positive ones. There is always insight to be gained from a well-run experiment, even if it âfailsâ.
Additionally, you can exclude all experiments that havenât lasted for at least full 14 days (7 if youâre a big-name app) to provide another level of assurance that your âestimated effectâ was valid.
The âestimated effectâ of the experiment will be the average value, between the lower and higher bounds of the given performance range.
In the example given above, you should calculate the average between lower (-1.4%) and higher (+4.8%) bounds. In our case, thatâs 1.7% which is the estimated effect that this variant may have on CVR if it was applied.*
*more on how to interpret the performance bar in future publications.
What MDE value to adopt if you have no valid historical data?
It can happen, that you donât have any experiments which reached statistical significance or were conducted correctly in your directory. If thatâs the case, below you can check the general values you can expect from your experiments:
(All values present relative (not absolute) difference of CVR.)
Newly launched app, with no previous optimization efforts:
- distinct change to the icon: 5% - 25%
- distinct change to the screenshots 1â4: 5% - 20%
- distinct change to the screenshots 5â10: 1% - 5%
- redesign of the whole screenshots set: 5% - 25%
- distinct change to the featured graphic (only if the video is added):
5% - 20% - adding/removing the app preview video: 5% - 20%
Well-established app, already in a process of conversion optimization:
- distinct change to the icon: 1% - 4%
- distinct change to screenshots 1â4: 1.5% - 3%
- distinct change to screenshots 5â10: >1.5%
- redesign of the whole screenshots set: 2% - 6%
- distinct change to featured graphic (only if the video is added): 1.5% - 5%
- adding/removing the app preview video: 1% - 5%
Remember that these values are only indicative, based on experiments I run as an ASO specialist. Even though I tried to average them with results from various apps and industries, your app has a different audience base. Also, the screenshots might be more or less optimized; therefore, your typical estimated effect may be completely different from mine. Consequently, itâs impossible to predict the exact impact the experimentâs variants may have on CVR â if we knew it before, we wouldnât need A/B testing at all.
Bonus tip for testing screenshots (since youâve came so far in this article)
As you notice, in the list above, depending on which screenshot you decide to implement changes to, the MDE significantly changes. This indicates, how important it is, to prioritize the set of screenshots visible on your app page without scrolling (a.k.a. first impression frame, similarly to the âabove the foldâ rule in desktop A/B tests).
The reason for that is pretty straightforward â most users never scroll through your screenshots (according to the most recent studies done by StoreMaven, as little as 4% of users scroll through the portrait screenshots gallery on average). Of course, this number can significantly differ depending on the appâs specific audience or traffic split. Still, Iâd say itâs reasonable to assume that majority of users generally donât scroll through the screenshots.
Considering how few users interact with the screenshots visible after scrolling, plus, the fact that Play Store doesnât validate if visitors have seen your experimentâs variables, it doesnât make much sense to test changes introduced to the screenshot positioned in the middle (or end) of your set, no matter what volume of traffic your app gets.
Building your own MDE Library
If youâre serious about your A/B testing process, it makes sense to lead the MDE library regularly, gathering all your experiment results with estimated impact per each market and element. Thanks to this approach, youâd be able to create even more accurate estimates of what MDE you can expect from your experiments.
Donât stress too much about the accuracy though. Itâs not about getting it exactly right, but understanding how a certain degree of changes impacts your CVR).
Now that (I hope) you got a better understanding of MDE and where to look for it, youâre ready to prioritize & regroup experiments in your backlog, based on your traffic & conversion rate.
A good idea would be to calculate the required minimum sample size (e.g. utilizing the calculator I mentioned above) and group the markets you run tests on into buckets and plan the whole ideation & creation process accordingly:
Obviously, each market is a unique set combined of CVR and traffic volume, so the possibilities are endless (e.g. you can work with high traffic market, but with extremely low CVR and the opposite) making it essential to calculate the sample size and understand your marketâs capacity.
Optimizely, a well-known platform for desktop A/B testing, does a great job of explaining this concept and suggests using the MDE for prioritization purposes in a straightforward method:
Rather than trying to get your MDE exactly right, use it to set boundaries for your experiment so you can make informed business decisions. With a more nuanced understanding of how MDE affects sample size and goals, you can decide when to keep running an experiment given certain operational constraints.
Use it to benchmark how long to run an experiment and the impact you are likely to see, so you can prioritize experiments according to expected ROI. Depending on how granular you want your results to be, you can set expectations for how long it may take to run an experiment based on MDE.
đ§ Key takeaways from the text: đ§
- Avoid testing changes on screenshots that are not visible without scrolling (unless thatâs a part of a whole set redesign or you know from prior research that a high percentage of your audience scrolls through the screenshots),
- Generally aim for testing bold, distinct changes, as itâs easier to get reliable results from bigger modifications than meticulous changes. Reserved, risk-averse testing requires vast volumes of visitors in your experiment.
- Treat MDE as a handy tool to prioritize your experiments and understand the degree of changes needed per each market.
- If you have the app preview added, prioritize testing the featured graphic, as the rest of the screenshots are not entirely visible without scrolling.
- Test the featured graphic only when you have the app preview added. After Aprilâs 2018 update to Google Play Store, itâs no longer a commonly displayed element on the top of your store listing page.
- Few people scroll through the screenshots and even less read the full description. So, donât waste your time testing this element unless thatâs to get keyword ranking benefits.
- Always calculate your Minimum Sample Size per market and try to reach it while running experiments. Do not end the test as soon as GPE gives you a recommendation.
Remember! â ď¸
Google Play Store Experiments, like many others A/B testing tools, do not validate if your experiment reached the required sample size, it only displays recommendations once your experiments reached statistical significance, which is not a stopping rule! (more about it in this post)
If you want to know how to calculate the sample size for Play Store experiments (utilizing the knowledge you got today about MDE) follow my next publications for more tips.
As this is my first post, Iâd highly appreciate sharing your honest thoughts and suggestions as well as constructive criticism on how can I improve the quality of content shared with you :)