Why we use lenient p-value thresholds like 0.4 for A/B experiments at Agoda— Part 1

Published in

Agoda Engineering & Design

11 min readDec 15, 2023

Like many companies, we use statistical tests based on one-tailed p-value, where a low p-value means the improvement in our A/B experiment is significant enough, and we should “take” it, i.e., put the change into production. Industry and academia usually use the p-value threshold of 0.05; however, at Agoda, we use a much more lenient threshold of 0.3 to 0.45. In this article, we will explain that this lenient threshold, despite its much higher false positive rate, is actually more beneficial to the company.

We begin with an overview of key concepts: actual lift (the real underlying improvement from your experiment), observed lift (what we measure in the experiment), and p-value. Then, we argue that in a business context, the primary objective isn’t necessarily to minimize the false-positive rate. Instead, it’s to maximize the overall improvement of the product.

Finally, we’ll show that using a more lenient p-value threshold can generate greater overall improvement because even if we mistakenly take a bad experiment, we gain more by simply taking more experiments. We end the article with an example to find the optimal p-value threshold, along with some extra discussions on the actual practice.

Understanding p-value basics

First, we will explain how p-value works at a high level. When we conduct an A/B experiment, we have two versions of our product (for example, a page of our website), we let separate sets of users try using it, and we measure a metric that quantifies how good each version is (for example, conversion rate). We can then compare the metric between the A and B versions.

Actual lift vs. Observed lift

If our experiment has no effect or lift, meaning the B version has no improvement over the A version, we say the actual lift is 0. However, when we run the experiment and observe the lift, the observed lift will generally be near 0, but not exactly 0, due to some noise. In statistical terms, this means even though the actual lift is 0, there is a non-zero probability for the observed lift to be non-zero. The probability is still highest near 0.

Fig 1: Actual lift of 0 can generate observed lift somewhere around 0, but not exactly 0 due to noise.

In reality, we never know the actual lift of an experiment, and we can only see the observed lift. Most statistical test studies involve trying to infer information about the actual lift using the observed lift.

The probability distribution of the observed lift depends on your noise level, including the nature of the experiment, the number of samples, and the duration of the experiment. Statisticians then like to normalize things, so they come up with the concept of p-value, which is a monotonic one-to-one mapping of the observed lift to a set of p-value between 0 and 1. The key point of the p-value is that if the actual lift is 0, then the p-value distribution is uniform between 0 and 1. If the observed lift is large positive, the p-value is near 0. If the observed lift is large negative, the p-value is near 1.

Fig 2: P-value is a mapping of the observed lift to a value between 0 to 1. If the actual lift is 0, then the p-value distribution is uniform between 0 and 1. The red dot on the observed lift corresponds to the red dot on the p-value plot. Same for other colors.

Now, what if your A/B experiment actually has positive actual lift? Then the observed lift will be centered around the actual lift, meaning the observed lift will have more chance to be positive. Because positive observed lift corresponds to a p-value near 0, the p-value distribution will be skewed towards 0.

Fig 3: If the actual lift is positive, then the observed lift is mostly positive, and p-value distribution is skewed towards 0.

Statistical test and p-value threshold

So far, our thought process goes from the actual lift to the observed lift to the p-value. Now, what if we ask backward: if we see a p-value near 0, what should be our actual lift? Intuitively, if a positive actual lift gives you more chance to see a low p-value, then seeing a low p-value should also likely mean that the actual lift is positive, too. This is the central idea of statistical tests; you observe the result and compare it to the noise level to get the p-value; if the p-value is low enough, we say the experiment wins (The B version is better than the A version). In our business setting, the experiment can then be “taken,” meaning the B version gets put into production.

The next question is how low the p-value should be so that we can conclude that our experiment wins. In academia and industry, we usually use a p-value threshold of 0.05. (This is roughly the probability of tossing a coin and seeing all heads or all tails 5–6 times in a row, which is the point where we intuitively conclude the coin is not fair.) This strict threshold is generally good if we want to be confident about our results, but is this what we really want to achieve in a business setting?

Optimization goal: False positive rate vs Sum taken actual lift (STAL)

In academia, the goal of an A/B experiment is usually to see whether B is better than A or not. The result of the experiment is then used for further research; therefore, it is imperative that we are confident about it. This means great care is needed in pre-determining the expected size of the improvement and the required sample size or experiment duration to achieve enough experimental power. In this sense, the goal is to limit the false positive rate, which is the chance that a flat experiment (no improvement between B and A) would get taken just from noise. With this goal, it makes sense to set a low p-value threshold so that the false positive is rare enough.

In business, it is tempting to use the same low p-value threshold in experiments to be confident that each experiment has a positive actual lift. But is that the best for the company? Let’s step back a bit and see why we run experiments in a company. Each experiment is done for a change in the company’s product, and we want to see whether each change is a positive change (actual lift > 0); if yes, we put the change into production.

Say we make three changes that generate actual lift of 100, 50, and -50$ per day respectively, then ideally, we would want to take the first two experiments and drop (not take) the third one. This would net us an extra 150$ per day if we assume that the experiments are uncorrelated such that the actual lift can add up linearly.

Fig 4: In an ideal world, we maximize the sum taken actual lift by taking experiments with positive actual lift and dropping the rest.

In an ideal world where you know the actual lift of each experiment, you would want to take all experiments with positive actual lift and drop the negative ones. In this sense, we try to maximize the sum taken actual lift (STAL), which is the total benefit of your taken experiments.

In reality, you do not know the actual lift of the experiments and only see the observed lift and the p-value of each experiment. If the p-value of an experiment is significant, then you would want to take it. However, imagine an experiment that gives you a positive result with a good but insignificant p-value of 0.2. Would it make sense to “drop” (not take) the experiment because the p-value is insignificant? Of course, we don’t know whether the actual lift is positive, but intuitively, it should have more chance to be positive than negative. So even if we are not that certain about its actual lift due to the insignificant p-value, we might as well take the experiment because it has more chance of increasing STAL. This means using a more lenient p-value threshold may give you better STAL, which aligns more with the company’s goal, even if you have a worse false positive rate.

Let’s illustrate this with an example below: we run seven experiments, each with a given actual lift (grayed out to emphasize that it’s unknown) and an observed p-value. If we use a strict p-value threshold of 0.05, then we would take one experiment and get STAL = 100. However, if we use a more lenient threshold of 0.30, we would take four experiments and get STAL = 170, even though we end up taking one negative experiment.

Fig 5: More lenient p-value can generate more sum taken actual lift (STAL) even if the false positive rate is worse.

We have shown with an example that using a more lenient threshold can benefit the company more. The average actual lift per experiment may be lower because we get more false positives, but the increase in the number of taken experiments would be higher, meaning the sum of taken actual lift can be higher.

In the next section, we show that even if we do not know the actual lift of each experiment, as long as we know the actual lift distribution of the set of experiments, we can calculate the optimal p-value threshold that maximizes STAL.

How to pick the p-value threshold that maximizes STAL

Let’s start with sets of assumptions about your experiments. Assume we run 100 experiments in total and know their actual lift distribution.

Example 1: all 100 experiments have actual lift = 0

Because all experiments have 0 actual lift, it does not matter which and how many experiments we take; the sum of taken actual lift will always be 0. So, the choice of p-value threshold does not matter.

Example 2: all 100 experiments have actual lift = 1

Because all experiments have a positive actual lift, we want to take as many experiments as possible. Due to noise, each experiment can create a p-value anywhere between 0 and 1, even with the positive lift, so we want to set the p-value threshold to 1 in order to take all experiments. The total STAL is then 100 (lift = 1 per experiment for a total of 100 experiments).

Example 3: all 100 experiments have actual lift = -1

Because all experiments have a negative actual lift, we want to take as few experiments as possible. So, we can simply set the p-value threshold to 0, take 0 experiment, and get STAL of 0.

Example 4: 50 experiments have actual lift = 1, 50 experiments have actual lift = -1, but we don’t know which.

Now, this is trickier. If we set the p-value threshold too strict, we would not take many experiments, but if we set it too lenient, we would take a lot of negative experiments which hurt our STAL. This reasoning can be explained with the illustration below.

Fig 6: How we calculate STAL as a function of p-value threshold from the knowledge of actual lift distribution.

First, we look at the p-value distribution (left graphs) of the positive experiments (actual lift = 1) and the negative ones (actual lift = -1). For the positive ones, the p-value is mostly near 0, but the negative ones, p-value is mostly near 1.

Next (the second graph from the left), we try varying the p-value threshold. As we increase the p-value threshold from 0, we would take more experiments; however, because the positive experiments have p-value distribution skewed towards 0, the number of taken experiments grows faster. For the negative experiments, we also get more taken experiments as we increase the threshold, but we won’t take a lot of experiments until the p-value threshold is close to 1.

In the third graph, we look at the STAL contribution. Because the positive experiments have actual lift = 1 each, STAL from them is just 1 multiplied by the number of taken experiments. So, the second and the third graphs for positive experiments look the same. However, for the negative experiments, each taken one generates actual lift = -1, so the STAL from them is -1 multiplied by the number of taken experiments, so the third graph is the upside-down version of the second graph.

Finally, in the fourth graph, we sum STAL contribution from both positive and negative experiments. We can see that STAL is 0 at p-value threshold of 0 and 1. This makes sense: if the threshold is 0, we take no experiment; but if the threshold is 1, we take all experiments and sum actual lift is 0. Notice that the maximal STAL happens at p-value threshold of 0.5. So for this distribution of actual lift, we should use such lenient threshold to maximize STAL. From the graph, we can see this gives much better STAL than the conventional threshold of 0.05.

Example 5: Same as example 4 but we vary the number of experiments in each bin

Let try keeping the actual lift of the two bins to 1 and -1, but varying the number of experiments in each bin. Then we try looking at the STAL as a function of the fraction in each bin.

Fig 7: The number experiments with actual lift = 1 and -1 are 70:30 (red dashed), 50:50 (yellow solid), 30:70 (green dotted). Circles denote the point with maximal STAL.

In the graph above, we plot STAL as a function of the p-value threshold. The three lines have different numbers of experiments in each bin; for example, the red line has 70 experiments with actual lift = 1 and 30 with actual lift = -1. We can see that the more positive experiments we have, the higher the optimal p-value threshold is. This makes sense because if we know most of our experiments are positive anyway, we can just relax and take a lot of experiments. However, if most of our experiments are negative (green line), then we have to be cautious and set a stricter threshold.

The optimal p-value threshold depends on the actual lift distribution of our experiments

As we have shown with examples. The optimal p-value threshold can be anywhere between 0 and 1, depending on the actual lift distribution of your experiments. The more positive your actual lift is, the higher the optimal p-value threshold.

This finding still sounds impractical because we do not know the actual lift distribution in the first place, so we don’t know how to pick the p-value threshold!

In the next part of this article, we will show how to estimate the actual lift distribution from a set of past experiments. This approach will enable us to determine the optimal p-value threshold suitable for your experiments. Additionally, we will also discuss the concerns of using a lenient threshold and how we make it more practical.