Experimental Study in Tech: A/B Testing Structure
Experiment Design & Metric Definition & Post-Test Analysis & A/B test Pitfalls and Strategies
Statisticians and data scientists often answer a question: Does X cause Y?
There are two types of methods o answer this question:
- Observational Study: more details this article
- Experimental Study (also called A/B Testing in tech industry)
Experimental Study
By conducting experiments on a small group of users (artificial intervention), estimate the causal effect of X intervention on business metric Y, to guide future company strategies.
- In tech industry, X can be new feature in web/app, new product or new promotion strategy etc. Y can be conversion, user engagement etc.
- In clinical industry, X can be a new drug, new dose or new treatment strategy etc. Y can be side effects, event rates, health condition etc.
- For the success of experimental study, in addition to mastering the statistical foundation of experimental research, it is also necessary to have specific domain knowledge in specific industries and projects, such as understanding of variables and characteristics of user behavior. Because these are the key factors in experimental design and post-test analysis.
A/B Testing Structure
- Define Null & Alternative Hypothesis (clarify research question)
- Metric Definition
- Sample Size Calculation (one of power analysis application)
- Randomization
- Post-Test Analysis
- Customer Targeting and strategies optimization (Please find more details in this article)
1. Define hypothesis (clarify research question)
- Target Population: all user? user from specific subgroup?
- Unit of Division: user id? cookie? — How to define a subject or unit in this experiment
- Treatment: intervention of interest in the experiment (X)
usually define only one intervention
but if >1 intervention group: ANOVA + multiple t-test
or if ratio evaluation metric have >2 categories: Chi-square test
must clearly define the intervention, which cannot be a vague concept.
- Evaluation Metric: outcomes of interest in the experiment (Y)
2. More than one metric
Usually we need to define two types of metric
Evaluation Metric
- Counts: DAU ( # of Daily Active User), WAU, MAU
It’s important to specifically define what user behavior is considered as “Active”
Login time exceeds 5 minutes? Click on a certain interface (Ins story)?
- Distribution Metric: average session time on a site, average number of clicks before purchase/conversion
- Ratio / Probability: conversion rate (specific user action: purchase, clicks, upgrade etc.), retention rate, user stickness (DAU/MAU)
Invariant Metric
- Metrics that shouldn’t show significant changes during the experiment process, because the invariance of these metrics maintains the randomness quality of the experiment.
- The selection of invariant metric is related to the selection of evaluation metric, the overall principle is to ensure the randomness of the experiment.
Ex. Pre-conversion behavior, Device type, User count/Visits, Geographic location
3. Sample Size Calculation
- [2]: treatment group and control group
- [μ_c — μ_t]: effect size
The practical significance difference between treatment group and control group that company would like to expect
The smaller the effect size, the larger the required sample size, and the longer the experimental period. However, when the sample size is large enough, even a tiny effect can be significant. But this tiny effect has no meaningful impact on business decisions.
- [σ]: Estimated Standard Deviation of evaluation metric (outcome), depends on expert domain knowledge OR historical data
- [α]: significance level (acceptable type I error, 0.1, 0.05, 0.01); lower α, then higher sample size requiremnet
- [β]: 1 - Power (0.8) = 0.2 = β; lower β, higher power, then higher sample size requiremnet
4. Randomization
Two meanings:
- Sample in experiment can represent the target population
- All covariates are uniformly distributed in two groups.
5. Post-test Analysis
Sanity Check
- The invariant metric should not show significant differences.
Parametric Test
If evaluation metric follows Normal Distribution
If evaluation metric is binary variable, also approximate Normal Distribution
- Student’s T-test: SD of evaluation metric is similar among 2 groups
- Welch’s T-test: SD of evaluation metric is different among 2 groups
- ANOVA test: when treatment group > 1
when treatment group > 1
1. ANOVA: check if at least one group is significant different with other groups
2. Multiple Paired T-test: check which two groups are significant different (with p-value adjustment by Bonferroni Correction OR False Discovery Rate)
Non-Parametric Test
If evaluation metric doesn’t follow Normal Distribution or SD is different or N is small
- Wilcoxon Rank Sum Test (Mann Whitney U test): evaluate rank instead of actual value of evaluation metric
If evaluation metric is categorical variable (≥ 2)
- Chi-square test : when n>1000
- Fisher’s exact test: when n<1000
Although non-parametric test is less power and less precise, non-parametric test is more robust (small sample size; skewed distribution; outlier exists; different SD). The same experimental result can usually be evaluated with both parametric and non-parametric tests simultaneously.
A/B testing pitfalls & solution
Stop A/B test too early
- Generally, experiment duration = sample size / average daily traffic
- But if calculated experiment duration is less than 1 weeks, or not a multiple of 7 days, we should extend the experiment duration to 2 weeks (2 business cycles), accounting for seasonality effect of weekends and holidays.
Network effect
- In social media platforms, T group and C group belong to the same social circle. The behavior of C group will be influenced by the temptation of T group, so the effects of T group will spill over to C group, leading to an underestimation of the difference in effects, which is false negative.
Novelty effect / Primacy effect
- When a new product is just launched, the audience’s reaction may not be real, it could be due to curiosity or bias.
- Extend experiment duration and wait for the novelty effect to disappear.
- OR limit the sample size to new users, because their perspectives and experiences are fresh.
Evaluation metrics have opposite results
- Ex. conversion rates increase, but retention rates decrease
- The trade-off of the company’s long-term and short-term goals
Summary
- Define Null & Alternative Hypothesis (clarify research question)
- Metric Definition
- Sample Size Calculation (one of power analysis application)
- Randomization
- Post-Test Analysis
- Customer Targeting and strategies optimization (Please find more details in this article)
If you find this article helpful to you, please click clap and follow to inspire me. I will publish related blogs on data science and statistical analysis regularly!
Thanks for your reading and feel free to leave comments and discuss!