A/B Testing for Product Managers: The Data Science Behind the Decisions

Blake Swineford
4 min readAug 15, 2023

--

A/B testing has emerged as a powerful tool to make informed decisions. But it’s not just about pitting two product versions against each other and going with gut feelings. The real magic lies in the underlying math and statistics that validate these decisions. Here is what every product manager should know about the process:

1. Determining Sample Size

Before you start an A/B test, you need to know how many users you should include to get reliable results. This is your sample size.

Factors affecting sample size:

  • Baseline conversion rate: This is your starting point. It’s easier to detect a change from 50% to 55% than from 5% to 10%, even though both are a 5% increase. The closer your baseline is to 0% or 100%, the larger the sample you’ll need.
  • Minimum detectable effect: This is the smallest change that would matter for your business. It’s chosen based on business needs. A small company might need a 10% increase to justify a change, while a larger company might only need 2%.
  • Statistical power: Set at 80% traditionally, this means there’s an 80% chance of detecting a real effect when it exists. Why 80%? It’s a balance between being sure of your results and the practicality of testing a larger sample.
  • Significance level (α): Commonly 5%, this is your risk of concluding there’s an effect when there isn’t.

How do you determine the right sample size?

While the math can seem complex, there are tools to simplify the process. One of the most user-friendly calculators is provided by Evan Miller. Input your baseline conversion rate and the desired minimum detectable effect, and the calculator will return the required sample size for each group. (This tool assumes standard values for statistical power (80%) and significance level (5%), which are typical for most A/B tests.)

2. Interpreting Results

Let’s say you’re on the advertising product team at Instagram. Your goal is to test if increasing the frequency ads are displayed will have a positive outcome. Let’s assume that within the app today users see ads every 15 posts as they scroll, if that was changed to every 10 posts would that increase revenue without negatively impacting user experience?

Let’s assume you have run an A/B test and collected data from both the control group (ads every 15 posts) and the test group (ads every 10 posts). Now you’ll need to determine if the observed changes in revenue and engagement are statistically significant, for that lets calculate the p-value.

Steps to Calculate the p-value:

  1. Set up Hypotheses:
  • Null Hypothesis (H0): There’s no difference in revenue or engagement between the two groups.
  • Alternative Hypothesis (H1): There’s a significant difference in revenue or engagement between the two groups.

2. Choose a Statistical Test: For A/B testing, the t-test is commonly used to compare the means of two groups.

3. Perform the Test: Using the data from both groups, perform the t-test. This is where things get more complicated, and you’ll want to work with your data science teammates to help out here, but if you don’t have any data scientists to help you can run this in Python with SciPy.

4. Get the p-value: The output of the t-test will include a p-value.

Interpreting the p-value:

  • If the p-value is less than a predetermined significance level (commonly 0.05), you reject the null hypothesis, meaning the observed difference between the two groups is statistically significant.
  • If the p-value is greater than this level, you fail to reject the null hypothesis, meaning the observed difference might be due to random chance.

Back to our Instagram example: Suppose the p-values for revenue is 0.01 and for engagement 0.02. Both are below 0.05, so you can conclude that the changes in ad frequency likely caused the observed differences in revenue and engagement.

Making a Decision with Weighted Metrics:

Now that you’ve validated the results, you can weigh the metrics based on their importance:

  • Revenue increase: +10% (weight 0.7) = +7
  • User drop: -5% (weight 0.3) = -1.5

Net effect = +5.5

The net effect is positive, suggesting the change might be beneficial. However, always consider the broader implications. While the revenue boost is appealing, a consistent drop in user engagement could have long-term consequences that outweigh short-term financial gains.

3. Trade-offs and Business Impact

Even if a result is statistically significant, it might not be practically significant. For example, a 0.01% increase in click-through rate might not justify the costs of implementing a change.

Considerations:

  • Long-term effects: Will the change benefit or harm user experience in the long run?
  • Cost vs. Benefit: Is the potential revenue increase worth the drop in another metric?
  • External factors: Are there external events that could be influencing the results?

4. Advanced A/B Testing Methodologies

  • Sequential Testing: This approach allows for periodic checks on test results rather than waiting for the entire sample size to be reached. If a significant difference is observed early in the test, you might decide to conclude it ahead of schedule.
  • Bayesian A/B Testing: Unlike traditional statistical methods, Bayesian testing provides insights into the probability distribution of outcomes. This approach offers a clearer perspective on the likelihood of one version outperforming the other.

Understanding the nuances of sample size determination, result interpretation, and the broader business implications, you can ensure that your decision is not only statistically sound but aligned with long-term business goals.

--

--

Blake Swineford

(Ex-Twitter) Product Management, Data Science, & Machine Learning