How to make sure your Play Store experiments deliver statistically valid results?

This post is the second part of “The ultimate guide to more effective A/B testing on Google Play Store”.

Mateusz Wrzeszcz
8 min readApr 26, 2022

My previous post about the Minimum Detectable Effect was supposed to work as a decoy, to visualize the strength of the correct statistical approach and considered testing on Play Store Console.

And judging by the feedback I received, I guess it came out to be useful at least for some of you, which is really heartwarming after some serious amount of time spent on it.

Today, in the second part of the guide to more effective testing, we will focus on probably the most often missed statistical principle — reaching the minimum sample size with your A/B tests, which helps to ensure your experiments are adequately powered and statistically valid.

If you got familiar with the previous article I published, you should already know that with each experiment you plan to launch, you have to make some brainwork beforehand, with the main steps being:

  1. Picking the markets you want to run your tests on and matching their bandwidth (traffic volume & conversion rate) with specific ideas from your experiment’s backlog.
  2. Making sure the experiments you have chosen are differentiated enough to have the chance to reach statistical significance and validity in the projected period of time.
  3. Estimating the desired change in CVR (your MDE) you aim to achieve with your experiments and calculating your minimum sample size (to ensure your experiments will have enough time to reach statistical validity).

Those three steps are notoriously skipped by developers, marketers, and even professional agencies taking care of experimentation for ASO, leaving a lot to be desired when it comes to ensuring the results you observe from an experiment are statistically valid.

False results and illusive gains

While talking with colleagues and clients on A/B testing for ASO, many complain about the lack of results’ replicability. Experiments that were supposed to deliver on average a 5–10% increase in conversion rate, once applied, came out to have little to no impact at all. Or sometimes they even decreased the CVR. I’ve heard stories when after a whole year of testing and applying “positive” results, the conversion rate is at the same level it used to be on the first day of optimization.

Does this also sound familiar to you? 🤔

One of the most common reasons for that state of being is the fact that companies base on false experiments’ results, also known as false positives or negatives, being the result of an incorrect approach to testing and underpowered experiments.

The authors of the Netflix Technology Blog did a great job in explaining the concept behind false positives and negatives:

There are two types of mistakes we can make in acting on test results. A false positive (also called a Type I error) occurs when the data from the test indicates a meaningful difference between the control and treatment experiences, but in truth there is no difference. This scenario is like having a medical test come back as positive for a disease when you are healthy.

The other error we can make in deciding on a test is a false negative (also called a Type II error), which occurs when the data do not indicate a meaningful difference between treatment and control, but in truth there is a difference. This scenario is like having a medical test come back negative — when you do indeed have the disease you are being tested for.

Source: https://netflixtechblog.com/interpreting-a-b-test-results-false-positives-and-statistical-significance-c1522d0db27a

Source: https://netflixtechblog.com/interpreting-a-b-test-results-false-positives-and-statistical-significance-c1522d0db27a

Even though it may sound a bit knotty, I got a good word for you. Both of these highly unwanted circumstances can be minimalized (you can never totally get rid of them) by sticking to one, basic statistical principle —reaching the minimum sample size with your experiments.

Why do you even need to care about the minimum sample size?

Sometimes I hear people saying: “I’m sure my experiments are statistically valid, since I apply the winning variant only when Google Play Store displays recommendations as to which variant outperforms the rest”.

Recommendations displayed by Play Store Console once your experiment reaches statistical significance.

The biggest surprise? Your A/B testing tool (in our case Google Play Store Experiments) won’t tell you if your sample size was reached. It only tells if your experiment temporarily reached the targeted statistical significance, but this doesn’t mean your experiments are adequately powered to provide statistically valid results.

It’s your role, as the ASO/CRO manager to calculate the required sample size per experiment and market you plan to launch it in, and then, monitor if your variants reached the required volume of Installers.

Why reaching statistical significance shouldn’t be your stopping rule?

The moment when Play Store gives you recommendations is NOT the moment when your experiments reach statistical validity, but the statistical significance.

In fact, your Play Store experiment can reach the statistical significance many times over the period when it’s live, oscillating between significant and insignificant at many points.

The above graphic represents how an experiment can fluctuate between the significant and insignificant as it continues to gather data. Source: Marketo

If you’ve run multiple experiments in Play Store, you might have noticed a similar situation as well. After a few days of displaying the “More data needed” message, eventually, GPE suggests that “Variant B performed best”. That’s exactly the moment when your experiment reached statistical significance.

However, if you’d waited a few days more, you could theoretically notice that the message changed to the complete opposite, and now suggests that: “Variant C performed best” or back to “More data needed”. That’s exactly because your experiment goes above and below the settled 90% confidence level (0.1 statistical significance) like it’s shown on the graphic above.

Providing that you blindly follow Google suggestions and applied the “Variant A” without checking your sample size, you’ve just drastically increased the chance of applying a false positive variant.

And if it happens on and on, you can only imagine, how many of the learnings gathered from such tests are completely misleading, with no actual evidence to support them.

It has been proven by Mats Einarsen in a study conducted on two identical pages tested against each other, that:

  • 771 experiments out of 1.000 reached 90% significance at some point
  • 531 experiments out of 1.000 reached 95% significance at some point

What does it mean for us? As much as 77% of A/A tests (two identical pages used to evaluate the setup of the experiment) will reach 90% significance at some point. If tests featuring two identical pages can reach statistical significance 3 times out of 4, you certainly cannot be confident that the results you get from the test you run are valid by sticking only this statistic.

Reaching statistical significance means nothing, if you haven’t reached your minimum sample size and tested for at least a full week. Therefore it shouldn’t be treated as a stopping rule for A/B tests.

If you’d like to learn more about the rules for stopping A/B tests, I recommend reading this CXL Institute’s article and other related posts, available on their blog.

When should you finish the experiment and apply a new variant?

What should be the correct indication that you may finish your tests and apply a new variant? Rather than a single rule, there’s a sequence I stick to:

  1. Conduct your experiments for at least 7 days (full business cycle).
  2. Calculate and reach the minimum sample size per specific market and experiment you plan to run.
  3. Reach the statistical significance (that’s the moment when GPE displays recommendations).
  4. Stick to all the advice from the previous article on Minimum Detectable Effect.

My take on deciding which experiments to apply

Obviously, with A/B testing nothing is immediately apparent, and sometimes, under specific circumstances, I decide to apply experiments that haven’t reached the statistical significance (and so Google still gives the message: “More data needed”).

An example of such case is when I see that all the other ‘requirements’ from the above list are checked and the performance bar of my variant is showing a strong, positive trend, with the treatment outperforming the rest of the variants throughout the whole experiment’s period (as visible on the graphic below). In such case, depending on the type of the experiment, estimated effect, and my risk appetite, I make a decision to apply the treatment or stick with the default.

Sometimes even when the experiment doesn’t reach the statistical significance, I apply it since its performance over time is promising and the sample size is reached. It’s important to validate a such result and not treat it as learning equal to statistically significant learnings.

Generally, due to the way in which GPE is built, especially the lack of control over the traffic sources and segmentation (about which Gabe Kwakyi wrote an excellent article) situation which I described above happens pretty often, due to the phenomenon called results smothering (not sure if that’s the official name though, but it’s used by Gabe in the article and reflects the case pretty well IMO).

Google Play Experiments don’t allow for segmentation control, or even segmented reporting, leaving us with no way to confirm whether dilution may or may not be occurring. Source: Incipia’s Blog

What’s really important, is to remember, that you should never take the results GPE produces for granted and treat them rather as a directional cue than an oracle, and make decisions only after prior validation of the results with a separate method, such as:

  • sequential analysis (a.k.a before/after analysis, which unfortunately can deliver unclear results due to the external factors influence, such as seasonality or paid acquisition)
  • backward testing (which can be a good solution for markets with paid acquisition being active and unstable)
  • A/B/B testing (the most universal, on which Apptweak’s Simon Thillay wrote a great article)

Only after combining the received results with a validation method, and in the perfect case scenario, other qualitative insights your organization owns, you get a full picture and become ready to make a deliberate decision on what to do with your experiments.

Key steps to take if you care about the reliability of your GPE tests:

  • don’t end your test as soon as Google gives you recommendations. Always calculate your Minimum Sample Size per each market and try to reach it with your experiments,
  • treat GPE results as a directional cue rather than an oracle, and always validate your results with another method,
  • remember that no matter how sophisticated your approach will be from the statistical point of view, your experiments are only as good, as the research you did beforehand, and the hypothesis you based them on,
  • a good way to think about A/B testing is that it’s a quantitative method to validate the qualitative insights, gathered beforehand by your organization. This helps exclude ideas with low potential to impact the CVR, such as: testing different device mockups’ frames or slight background color change.
  • generally it’s easier to get reliable results from experiments with highly differentiated variants, about which I wrote a whole article.

--

--