Expedia Group Technology-Data

How to Size For Online Experiments With Ratio Metrics

Sizing Derivation and Evaluation

Wenkai Bao
Expedia Group Technology
8 min readMar 28

--

Photo by Christopher Ruel on Unsplash

Expedia Group™, as a global online travel solution provider, heavily relies on A/B testing to continuously bring the best travel experience to our consumers and business partners. We at the Experimentation Science & Statistics team keep exploring new methodologies and techniques to add to the testing capability, enabling more robust and flexible experimentation across all Expedia Group™ brands.

One recent example is the capability to conduct an experiment with ratio metrics, especially how to determine the sample size.

For conventional count or continuous metrics like conversion or booking value, asymptotic normal sampling distributions enable statistical inference (p-value, confidence interval) that facilitates launch decision, thanks to the Central Limit Theorem.

A different cluster of metrics concerns the ratio between two metrics. They are of particular interest because the random assignment of control or treatment is often done at a different granularity than the analysis. For example, we may bucket at the user level, whereas the metric of interest is clicks per visit (a visit usually is defined to end after being inactive for n minutes or longer).

This misalignment invalidates the independence assumption of the conventional sampling distribution, as the observations (visit level) from the same user are not independent. The workaround is to treat the metric as a ratio between, say, the number of clicks per user and the number of visits per user. Both the numerator and denominator are statistics with their own probabilistic distributions.

For statistical inference with respect to ratio metrics, sampling distribution approximated by the multivariate delta method is needed (Deng, Knoblich and Lu 2018). The delta method sampling distribution in turn is used to compute the p-value. We walk through the how-to in §1.1, as well as its Python implementation. §1.2 shows how we formulate the sample size based on the delta method distribution. This formula ensures that the sample size (hence the test duration) can achieve both the desired true-positive rate and true-negative rate.

In §2 we use simulations to evaluate the performance of ratio metrics delta method testing procedure, in the sense of whether the actual power and significance level are consistent with their nominal values. In §3, we analytically address a practical concern about whether ratio metrics cause test duration significantly longer than if they were treated as conventional ones.

Throughout the article, we use clicks per visit or click-through rate (CTR) as the example metric to illustrate.

Table of Contents

§1 The Math Behind It

§2 Simulate to evaluate

§3 Sample Size Comparison

§4 Summary

§1 The Math Behind It

§1.1 Statistical Distributions

First, consider the scenario below:

yᵢ: number of clicks for the iᵗʰ user ID,

xᵢ: number of visits for the iᵗʰ user ID.

By the Central Limit Theorem, under some weak regulative conditions, the sample means of both {x} and {y} are normally distributed. Using the delta method with first-order Taylor expansion, the ratio metric R converges in distribution as

Ratio Metric Definition

The mean of R can be approximated by the ratio of the mean of y over the mean of x. The variance of R can be approximated by

Ratio Metric Sampling Variance Definition

In A/B testing, we have two ratios and therefore a two-sample test. Assuming independence between control (C) and treatment (T), we are actually testing the null hypothesis of

Null Hypothesis

with the underlying distribution being

Sampling Distribution of the Delta

§1.2 Ratio Metrics Sizing

Similar to the conventional cases, when the primary metric is a ratio, after determining the desired type I and type II error levels, we want to solve for the sample size n in the power function such that

Sizing Formula

where the sample size n becomes explicit if we replace the denominator of the right-hand side with the delta method variance formula in §1.1:

Sample Size Formula

Note that the τ here refers to the statistic’s value calculated from historical data, hence the subscript C, as we haven’t observed the treatment effect at the sizing stage.

Putting everything into action, below is a Python function implementing the logic mentioned above.

def ratio_sample_size(num_mean,denom_mean,num_var,denom_var,cov,alpha,power,relative_mde):
'''
Copyright 2023 Expedia, Inc. SPDX-License-Identifier: Apache-2.0.
num_mean: mean of the variable in the numerator.
denom_mean: mean of the variable in the denominator.
num_var: variance of the variable in the numerator, with degree of freedom 1.
denom_var: variance of the variable in the denominator, with degree of freedom 1.
cov: covariance between the numerator variable and the denominator variable, with degree of freedom 1.
alpha: significance level, in floating point number format.
power: desired 1-beta, in floating point number format.
relative_mde: the minimal detectable effect, relative to the baseline, in floating point number format.
'''
tau = (num_mean**2)/(denom_mean**2)*(num_var/(num_mean**2) + denom_var/(denom_mean**2)-2*cov/(num_mean*denom_mean))
z_alpha = norm.ppf(1-alpha/2)
z_power = norm.ppf(power)
baseline_ratio = num_mean/denom_mean
mde = baseline_ratio*relative_mde
n = math.ceil((2*tau*(z_alpha+z_power)**2)/(mde**2))

result = {'total_sample_size':n*2,
'each_variant_sample_size':n,
'baseline_effect':baseline_ratio,
'expected_lift':mde
}
return(result)

§2 Simulate to Evaluate Power and Significance

To evaluate the sample size calculator, we are mainly interested in if the calculated sample size is able to hold the designated nominal power and significance level or size. We can use a simulation outlined below:

  1. Extract a random sample of the data, both the numerator (number of clicks) and the denominator (number of visits). The sample size is determined by the ratio statistic sample size calculator, with certain values of input configurations such as type I error, power and Minimal Detectable Effect (MDE).
  2. Randomly assign 50% of the sample as control and the other 50% as treatment.
  3. Calculate the p-value of the comparison between control and treatment.
  4. For the data in the treatment group, if the numerator metric is non-zero, lift it by the true effect size, e.g. if the real data have 3 clicks, and we are simulating a true effect of 5%, then this datum has 3.15 clicks if it would be in treatment.
  5. Calculate the p-value comparing the control with the newly augmented treatment in step 4.
  6. Repeat steps 1–5 for n times; the proportion of p-values in step 3 that are smaller than α=0.05 is the empirical significance level. The proportion of p-values in step 5 that are smaller than α=0.05 is the empirical power.

Note that the data shown here are for illustrative purposes only. They do not represent the actuality in Expedia Group™.

First, consider 1,000 iterations where the nominal power is 80%, the significance level is 5%, and the true effect is 5%. In each iteration, we follow the simulation steps to create a 5% relative lift in the number of clicks. The two histograms below show the before-and-after effect of synthetically lifting the number of items by 5%:

Before augmentation, the distributions of the control and treatment are identical.
Before augmentation, the distributions of the control and treatment are identical.
Data in the treatment group now has a lift configured by the true effect size (5%).
Data in the treatment group now has a lift configured by the true effect size (5%).

We also assume the minimal detectable effect is the same as the true effect size. Then we check if the test p-value is below 5%, meaning the effect is detected by our test procedure successfully. Out of the 1,000 iterations, the test results are significant 83.4% of the time. The actual power matches the 80% nominal level. This is demonstrated in the graph below.

The pink histogram of p-values without any synthetic augmentation approximates a uniform distribution, as expected from a A/A test. The light blue histogram is the p-values after 5% synthetic augmentation in the treatment groups.
The pink histogram of p-values without any synthetic augmentation approximates a uniform distribution, as expected from an A/A test. The light blue histogram is the p-values after 5% synthetic augmentation in the treatment groups.

Next, we simulate various true effect sizes. The nominal power and significance level are always 80% and 5%, respectively. We see that the simulated actual power and type I error rate are consistent with the nominal ones:

Simulation of various true effect sizes, each of which was done 1000 times. Actual power and significance level align with pre-defined nominal ones.
Simulation of various true effect sizes, each of which was done 1000 times. Actual power and significance level align with pre-defined nominal ones.

§3 Sample Size Comparison

A question often raised is “How much more sample does ratio metric require, compared with if the metric were treated as an ordinary one?”.

Since the ratio metric accounts for covariance between the numerator metric and denominator metric, experimenters are interested in knowing if the ratio metric will significantly increase the test duration. This question can be answered by analytically comparing the two versions of sample size calculators.

Recall the sample size calculator formula for ratio metrics:

Sample Size Formula (Again)

where τ is the population variance approximated by the delta method.

Now consider the sample size calculator when the metric is a conventional proportion or continuous one:

Sample Size Formula for Non-ratio Metrics

where σ² is the population variance of the conventional metric.

Comparing n and n’, we can answer the question “How much more sample does ratio metrics need” by evaluating

Values of τ and σ² are estimated from historical data.
Values of τ and σ² are estimated from historical data.

§4 Summary

In Expedia Group™, as the occasions to use ratio metrics for experimentation become more and more often, we use the delta method to form the statistical distribution behind ratio metrics, and come up with the sizing formula to ensure that the testing behaves as expected.

Our simulations show that the resulting sample size and testing procedure can guarantee the desired power and significance level, therefore, they are appropriate to use when the primary metric in A/B testing is a ratio metric.

Acknowledgments

Special thanks to Jonny Carroll and Mirko Pace for their reviews and comments, to Juhi Pathak for the data generation process, and to Cristina McGuire for the insightful discussion during the research.

--

--