Why we use lenient p-value thresholds like 0.4 for A/B experiments at Agoda - Part 2

Published in

Agoda Engineering & Design

9 min readDec 20, 2023

This is part 2 of the article (Why we use lenient p-value thresholds like 0.4 for A/B experiments at Agoda). In part 1, we argued that in the industry setting, it is more beneficial to maximize the total improvement to your system (STAL: Sum Taken Actual Lift), and to achieve this goal, a more lenient p-value threshold should be used.

In this part, we move from the theoretical approach to a more practical one, showing how we can determine the best p-value threshold from the set of your past experiments. We then also discuss some disadvantages of using a more lenient p-value threshold, such as higher false positives, possible moral hazards, and overestimated value of the experiment, and how we can deal with such problems.

Calculating the actual lift distribution of a set of experiments and finding the optimal p-value threshold

Even though we do not know the actual lift of each experiment, we can infer the actual lift distribution of a set of past experiments from its observed lift or p-value distribution. Then, we can use the actual lift distribution to calculate the optimal p-value threshold. The steps are as follows:

Use the results of past experiments to get the observed lift distribution.
Use statistics to infer the actual lift distribution from the observed lift distribution. Note that in the previous section, actual lift takes discrete values, but in this section, we generalize so that the actual lift is also a continuous distribution.
Calculate the expected fraction of taken experiments and the corresponding STAL contribution. This is similar to the steps in Fig 6.
Get STAL as a function of the p-value threshold, then find the threshold that maximizes STAL.

Even though we tried to avoid using math in the previous sections, we will include math and some Python scripts here so that we can show a practical example. We expect the readers to know some statistical basics, such as normal distribution and statistical tests.

One simplification we do is that we express actual lift and observed lift in terms of z-score, meaning we normalize the noise away already; for example, if we measure mean metric across allocated users, we would use Welch’s t-test to get the z-score. Then, the relationship between z-score and p-value can be calculated using the code below, where z2p converts z to p, and p2z converts p back to z.

import scipy.stats as stat
import numpy as np
def p2z(inp):
    return -stat.norm(0,1).ppf(inp)
def z2p(inp):
    return 1.0-stat.norm(0,1).cdf(inp)

Step 1: Get observed lift distribution from past experiments.

First, you gather the result (z-score) of your past experiments. For example, if you have 100 experiments, you should put the 100 z-scores into a list. For example, we will generate it from random numbers drawn from a normal distribution (N(1,2) in this example).

np.random.seed(123)
observeZList = np.random.normal(1.0,2.0, 100)

Step 2: Get actual lift distribution

As shown in Fig 1, a single value of actual lift (or a delta function of actual lift distribution) generates a standard normal distribution N(0,1) for the observed lift due to noise. Therefore, if we generalize the actual lift distribution to be a continuous distribution, we can say that the observed lift distribution is the actual lift distribution convolved with N(0,1). Conversely, the actual lift distribution is the observed lift distribution deconvolved with N(0,1). Deconvolution is usually not a simple calculation, but we can simplify this by using the fact that the convolution of two normal distributions has a nice closed form (* denotes convolution):

This means that if we can approximate the observed lift distribution as a normal distribution N(m, s’²), the actual lift distribution is simply N(m, s’²-1). (This also means if you use random numbers in step 1, make sure the standard deviation (SD) of the normal distribution is greater than 1, otherwise, it does not represent a typical distribution of real experiments.) But, approximating a distribution as a normal distribution is easy: m and s’ are just the mean and the standard deviation of the observed lifts, respectively.

observeZMean = np.mean(observeZList)          #result = 1.054
observeZSd = np.std(observeZList, ddof=1)     #result = 2.268

Then, we can do the deconvolution.

actualZMean = observeZMean                   #result = 1.054
actualZSd = math.sqrt(observeZSd**2 - 1)     #result = 2.035

Now, the actual lift distribution is approximately a normal distribution given by the mean and SD above.

Let’s take a step back and think about what this distribution means; the actual lift of your experiments are not discrete values like in the previous examples 1–5 above anymore but are drawn from a distribution instead. With the above values of mean and SD, it means most of your experiments have positive actual lift (mean > 0), but a big chunk of it has a negative actual lift (because SD = 2). Therefore, we expect the optimal p-value threshold to be somewhere above 0.5 (because most actual lifts are positive) but not very far from 0.5 (because many experiments have negative actual lifts).

Step 3: Integrate to get expected STAL for each z-score threshold.

First, the z-score threshold is just the same as the corresponding p-value threshold.

pThresList = np.arange(0.02, 0.98, 0.02)
zThresList = p2z(pThresList)

Next, we integrate over the actual lift distribution. Remember in the discrete examples 1–5 above, for a given p-value threshold, we calculate 2 contributions to STAL: the probability to be taken and the actual lift if it gets taken. For the continuous distribution, we also need another term which is the probability density function (pdf), which is the weight for that z_actual.

The formula for each term is shown in the illustration below.

Fig 8: the integrand of STAL contribution as a function of actual z and z-threshold.

The code to calculate STAL as a function of the z-score threshold is then given by:

import scipy.integrate as intg
stalList = []
for zThres in zThresList:
    #Contribution from the actual lift at zActual = pdf(zActual) * prob(obsZ > zThres) * zActual
    def integrand(zActual):
        probActual = stat.norm(actualZMean,actualZSd).pdf(zActual)
        probObserveZMoreThanZThres = z2p(-(zActual-zThres))        
        return  probActual* probObserveZMoreThanZThres * zActual

    stal = intg.quad(integrand, -5, 5)[0]
    stalList = stalList + [stal]

Step 4: Find the p-value threshold that maximizes STAL.

This can be done by a simple plot and argmax.

optimalPThres = pThresList[np.argmax(stalList)]
plt.plot(pThresList, stalList)
plt.xlim((0,1))
plt.axvline(optimalPThres, linestyle = '--')

The plot we get shows the optimal threshold at p = 0.6, which is slightly above 0.5 as we expected.

Fig 9: STAL vs p-value threshold for a given observed lift distribution.

If we try varying the list of observed lifts in Step 1, we would get different optimal p-value thresholds, as shown in the plot below. More negative-heavy distribution has a lower optimal p-value threshold because we have to be stricter to not take negative experiments.

Fig 10: STAL vs p-value threshold for different observed lift distribution in step 1. Top green: N(1,²²). Middle blue: N(-0.5,1.⁵²). Bottom red: N(-2,²²)

Further adjustment for more practicality

We have shown how to calculate the optimal p-value threshold to maximize the improvement we make to our system. If our set of experiments mostly has a positive actual lift, then the optimal p-value threshold can be even higher than 0.5. At Agoda, even if most of the experiments are indeed positive, we still keep our p-value threshold around 0.3–0.45. There are many practical reasons to not go more lenient than this. We discuss them below, along with some practical considerations when we use a lenient p-value threshold.

Maintenance cost

In this article, we assume that STAL is simply a sum of the actual lift of each taken experiment, which means taking a large number of small positive experiments will still add value. This may not be the case for real experiments, as there is always a maintenance cost for adding more and more features that may not add much value. In this case, we can simply subtract a small fixed value for each taken experiment, which is equivalent to reducing the actual lift of each experiment to take into account the extra maintenance cost. This would lower the optimal p-value threshold, which means we are stricter in taking an experiment.

Maintaining good experiment quality

Using historically observed lifts to estimate the optimal p-value threshold assumes that your future observed lift distribution remains the same. This may not be the case if the p-value threshold is too lenient and the experimenters start running lower-quality experiments. Therefore, we pick the threshold around 0.3–0.45 to make sure people still keep the quality of experiments high, and not just run noisy experiments. We also check the observed lift distribution regularly to detect any drop in overall experiment quality.

Reporting the total improvement

In optimizing STAL, we lower the false positive rate. This means STAL (sum taken actual lift) will be much lower than the sum observed lift of the taken experiments because we take some negative experiments that happen to have positive observed lift due to noise. Therefore, if you usually report the lift of taken experiments as the KPI, the actual improvement (STAL) can be much lower. There are 2 ways to account for this.

For each taken experiment, run another identical experiment and report this lift instead. This will remove the selection bias, as negative experiments will likely have a negative observed lift in this new run.
Calculate the theoretical ratio between STAL and the sum observed lift. This involves integration similar to the previous section. Then, apply this correction factor to convert the observed lift to an estimation of STAL.

Why not just increase experiment power?

One may argue that we can both maintain a low false positive rate and achieve high STAL by a careful choice of experiment sample size and duration, such that each experiment has enough power. This can be impractical because in order to estimate the needed duration, we have to estimate the actual lift of each experiment, so we can estimate the required sample size.

This is difficult, especially for front-end changes such as rearranging the components on a webpage. Moreover, even if we can estimate the actual lift, we might find that the required duration is too long, and it hurts the production velocity. Finally, even if the power is not high enough, we can still learn a lot from the experiment, such as the secondary metrics like conversion rate, which may not be the main metric of the experiment. In short, rigorous experiment design may not be practical, and using a lenient p-value threshold can be a good compromise.

Summary

In a business setting, the goal of running experiments may not be finding the truth like in academia but more on maximizing the improvement of the product. This goal of optimizing for STAL instead of lowering the false positive rate can be achieved by using a more lenient p-value threshold.

The value of the optimal threshold can be calculated using the distribution of past experiments, which can ideally be more than 0.5. However, due to many practical reasons, we stick to the threshold of 0.3–0.45. By moving to this lenient threshold, we should also be careful about maintaining the experiment quality and not overestimate the actual improvement due to the higher false-positive rate.