Bayesian versus Frequentist A/B testing for the Curious Glossier Data Scientist

There are many articles about the differences between Bayesian and Frequentist philosophies, and most data scientists have pretty strong convictions about which one they prefer. So when it comes to picking which strategy your data science team will adhere to, it can be a pretty contentious topic.

A Frequentist-leaning data scientist might say a Bayesian approach is too subjective because a prior must be selected, while a Bayesian-leaning data scientist might say a Frequentist approach is foolish to ignore our knowledge about the world and focus solely on counts.

At Glossier, we hoped to settle this internal debate by simulating a series of experiments to test how using Bayesian and Frequentist A/B test evaluation methods affected our ability to detect small improvements in metrics. All else equal — sample size, baseline metric, difference in variants etc. — which method would best detect differences in variants?

A Continuous Growth Experimentation Engine

On the Shopping Experience (ShopEx) team at Glossier, we are focused on building an online beauty shopping experience that’s more inclusive, inspirational, and connected with others than shopping IRL. To do that, we have built a culture around hypothesis-driven development and experimentation. We’re frequently running experiments to learn more about our customers and how we can improve their experience on our e-commerce platform.

A Glossier.com Homepage Component

A Shared Language

Creating this experimentation culture meant we needed a shared language and way of talking about the results of these tests. In the world of statistics education, most begin learning about frequentist testing methods. Partially for that reason, we built our test evaluation suite around these methods. That way we could run many tests and be confident that our team understood why we made our decisions.

The Brutal Truth

Continuously learning how to improve the site may sound great, but the caveat was many of these tests were smaller changes that we didn’t necessarily expect to see huge impacts from. We had initially thought that if we could continuously make incremental improvements to the site then we would reap the rewards later on. Further, we could take the learnings from these smaller tests to better understand what our big levers were and where to invest in the future. But the reality is testing for small incremental changes requires large samples.

Small Changes Mean Long Runtimes and High Opportunity Costs

We struggled with trying to prioritize these smaller tests and how to evaluate what kind of MDE (minimum detectable effect) we should set. By using a realistic MDE based on results from previous tests, we were calculating rather large required sample sizes.

For more highly trafficked sites, large samples may mean just turning up the exposure but if you are already testing at 50/50 that means you have to run your test for longer. The opportunity cost of running this small test is it delays you from running a larger tests that could be more impactful. On the other hand, if we are sizing our tests to larger MDEs, to reduce experiment run times, we were likely to forgo detecting these small changes.

Below if have a conversion rate of 10% and we run a test with:

x% of the time the minimum size effect will be detected*
x% of the time a difference will be detected when one does not exist**

We can calculate out the range of sample sizes we need based on the minimum detectable effect. To detect a 1% change in a 10% conversion rate requires 99 times more samples than detecting a 10% change in conversion. For many industries, a 1% change in a metric doesn’t matter much but for an e-commerce site with a relatively high average order value, we do care about changes that small.

Enter Bayesian Testing

I had heard that with Bayesian testing, we could run tests more quickly and more accurately, but wasn’t sure how much more quickly or how much more accurately. I decided to run a simulation of experiments and evaluate them using both methodologies. I was mostly interested in tracking false negative rates because we were worried that we were missing incremental improvements.

Evaluating Experiments with Slightly Better Variants

We care a lot about conversion rate on site. Let’s pretend that the baseline conversion rate is 10% and then let’s pretend we have a small relative increase in our variant’s conversion rate. We can use numpy to simulate the results of both our control and variant.

control = np.random.binomial(1, 0.1, size=100000)
variant = np.random.binomial(1, 0.1*(1+slight_improvement), size=100000)

In our frequentist evaluation method, we can apply a Chi-Squared test and treat 0 as “no order” and 1 as “at least one order” as our categorical variables. If this test returned a p-value greater than 0.05, we would accept the null hypothesis of the Chi-Square test and should pick the control as the winner. This would be an example of a false negative because we know that the variant is actually better than the control.

from scipy.stats import chi2_contingency
# Where bucket control or variant and result is the simulated results drawn from a binomial distribution
chi2, p_val, dof, ex = chi2_contingency(pd.crosstab(df[‘bucket’], df[‘result’]))

For a Bayesian method, we model the conversion rate as a binomial distribution. The binomial distribution is the discrete probability distribution of the number of successes (purchases) in a sequence of n independent experiments (site visitors).

We establish a prior using a Beta distribution and a likelihood using our (simulated) experimental observations. The biggest question for me was what should we use as our prior. I chose one close to my known conversion rate (I know the conversion rate in this simulation but I also know our historical on-site rate).

Next, we run Monte Carlo simulations (using pymc3’s implementation of the NUTS sampler). From the results of that simulation, we can derive a probability that the variant is better than the control. If the probability that the variant is lower than 95% then we say the result is a false negative.

Sidenote: a big discussion for our team has been if the p value threshold has been so widely maligned how is a 95% significance threshold for a Bayesian method any different?

with pm.Model() as model: 
# define priors
prior_v1 = pm.Beta(‘prior_v1’, alpha=2, beta=2)
prior_v2 = pm.Beta(‘prior_v2’, alpha=2, beta=2)
 # define likelihood
like_v1 = pm.Binomial(‘like_v1’, n=n1, p=prior_v1, observed=obs_v1)
like_v2 = pm.Binomial(‘like_v2’, n=n2, p=prior_v2, observed=obs_v2)
# define metrics
pm.Deterministic(‘Variant — Control (Difference)’, prior_v2 — prior_v1)
pm.Deterministic(‘relation’, (prior_v2/prior_v1) — 1)
trace = pm.sample(draws=50000, step=pm.NUTS(), progressbar=True)
_ = pm.plot_posterior(trace[1000:], varnames=['Variant - Control (Difference)'], ref_val=0, color='#87ceeb')
Sample output from PyMC3 showing a false negative (78% probability the variant was larger than control with known 1% increase) and a true positive (97.% probability that variant was larger than control with a known increase of 2%

Simulating the Experiments and Calculating the False Negative Rate

What we learned pretty quickly during the simulation process was how much more computationally intensive the Bayesian method was. Simulating a simulation blows up fast. For that reason, I ran fewer simulations than I would have liked, in the future I’d like to recreate and expand on this analysis running this script on a cluster remotely.

For relative incremental increases in conversion of 1%, 1.5%, etc to 4.5%, I tested my false negative rate for 75 experiments using both frequentist methods and Bayesian methods and then calculated a binomial confidence interval around those probabilities.

from statsmodels.stats.proportion import proportion_confint
ci_low, ci_upp = proportion_confint(count, nobs, alpha=0.05, method='normal')

The Bayesian method appears to have lower false negative rate compared to the frequentist method, this difference was only statistically significant for an increase of 1.5%.

For the small changes, we found a very high false negative rate, over 90% for the Frequentist method and nearly 90% for the Bayesian method. It was much too high for me to justify the engineering time spent setting up the experiment if I only expected 1–2% lift.

No Free Lunch, Maybe Free Hershey’s Kisses

As I expected, I couldn’t magically get a free lunch by switching to Bayesian evaluation methods. What I got was more like a free piece of candy off the receptionist’s desk. Using a Bayesian method did seem to reduce my false negative rate. However, that reduction didn’t reduce my false negative rate enough for me to justify running short tests I expected small increases from.

In the future, we will probably continue working with Bayesian evaluation methods but we will also be more judicious about which experiments we invest in.

This exercise did end up having an impact on how our team thinks about what tests to run. We are leaning more towards running bigger experiments where you think you can make a meaningful impact rather than continuously testing. We’ve found that spending too much time running small experiments is a waste of energy because, regardless of the evaluation method you use, you’ve got to size your experiment appropriately.

Next Steps

We’d like to rerun this analysis looking at false positives and false negatives over a wider range of priors and baseline metrics. Our product managers routinely estimate the impacts of their tests before they run them. We’d like to productize the results of this and similar simulations so our team is empowered to find out for themselves if their experiment is likely to detect the impact they except.