Experimental Study in Tech: A/B Testing Structure

Experiment Design & Metric Definition & Post-Test Analysis & A/B test Pitfalls and Strategies

TING DS

5 min readJan 31, 2024

Statisticians and data scientists often answer a question: Does X cause Y?

There are two types of methods o answer this question:

Observational Study: more details this article
Experimental Study (also called A/B Testing in tech industry)

Experimental Study

By conducting experiments on a small group of users (artificial intervention), estimate the causal effect of X intervention on business metric Y, to guide future company strategies.

In tech industry, X can be new feature in web/app, new product or new promotion strategy etc. Y can be conversion, user engagement etc.
In clinical industry, X can be a new drug, new dose or new treatment strategy etc. Y can be side effects, event rates, health condition etc.
For the success of experimental study, in addition to mastering the statistical foundation of experimental research, it is also necessary to have specific domain knowledge in specific industries and projects, such as understanding of variables and characteristics of user behavior. Because these are the key factors in experimental design and post-test analysis.

A/B Testing Structure

Define Null & Alternative Hypothesis (clarify research question)
Metric Definition
Sample Size Calculation (one of power analysis application)
Randomization
Post-Test Analysis
Customer Targeting and strategies optimization (Please find more details in this article)

1. Define hypothesis (clarify research question)

Target Population: all user? user from specific subgroup?
Unit of Division: user id? cookie? — How to define a subject or unit in this experiment
Treatment: intervention of interest in the experiment (X)

usually define only one intervention
but if >1 intervention group: ANOVA + multiple t-test
or if ratio evaluation metric have >2 categories: Chi-square test
must clearly define the intervention, which cannot be a vague concept.

Evaluation Metric: outcomes of interest in the experiment (Y)

2. More than one metric

Usually we need to define two types of metric

Evaluation Metric

Counts: DAU ( # of Daily Active User), WAU, MAU

It’s important to specifically define what user behavior is considered as “Active”
Login time exceeds 5 minutes? Click on a certain interface (Ins story)?

Distribution Metric: average session time on a site, average number of clicks before purchase/conversion
Ratio / Probability: conversion rate (specific user action: purchase, clicks, upgrade etc.), retention rate, user stickness (DAU/MAU)

Invariant Metric

Metrics that shouldn’t show significant changes during the experiment process, because the invariance of these metrics maintains the randomness quality of the experiment.
The selection of invariant metric is related to the selection of evaluation metric, the overall principle is to ensure the randomness of the experiment.

Ex. Pre-conversion behavior, Device type, User count/Visits, Geographic location

3. Sample Size Calculation

[2]: treatment group and control group
[μ_c — μ_t]: effect size

The practical significance difference between treatment group and control group that company would like to expect
The smaller the effect size, the larger the required sample size, and the longer the experimental period. However, when the sample size is large enough, even a tiny effect can be significant. But this tiny effect has no meaningful impact on business decisions.

[σ]: Estimated Standard Deviation of evaluation metric (outcome), depends on expert domain knowledge OR historical data
[α]: significance level (acceptable type I error, 0.1, 0.05, 0.01); lower α, then higher sample size requiremnet
[β]: 1 - Power (0.8) = 0.2 = β; lower β, higher power, then higher sample size requiremnet

4. Randomization

Two meanings:

Sample in experiment can represent the target population
All covariates are uniformly distributed in two groups.

5. Post-test Analysis

Sanity Check

The invariant metric should not show significant differences.

Parametric Test

If evaluation metric follows Normal Distribution

If evaluation metric is binary variable, also approximate Normal Distribution

Student’s T-test: SD of evaluation metric is similar among 2 groups
Welch’s T-test: SD of evaluation metric is different among 2 groups
ANOVA test: when treatment group > 1

when treatment group > 1
1. ANOVA: check if at least one group is significant different with other groups
2. Multiple Paired T-test: check which two groups are significant different (with p-value adjustment by Bonferroni Correction OR False Discovery Rate)

Non-Parametric Test

If evaluation metric doesn’t follow Normal Distribution or SD is different or N is small

Wilcoxon Rank Sum Test (Mann Whitney U test): evaluate rank instead of actual value of evaluation metric

If evaluation metric is categorical variable (≥ 2)

Chi-square test : when n>1000
Fisher’s exact test: when n<1000

Although non-parametric test is less power and less precise, non-parametric test is more robust (small sample size; skewed distribution; outlier exists; different SD). The same experimental result can usually be evaluated with both parametric and non-parametric tests simultaneously.

A/B testing pitfalls & solution

Stop A/B test too early

Generally, experiment duration = sample size / average daily traffic
But if calculated experiment duration is less than 1 weeks, or not a multiple of 7 days, we should extend the experiment duration to 2 weeks (2 business cycles), accounting for seasonality effect of weekends and holidays.

Network effect

In social media platforms, T group and C group belong to the same social circle. The behavior of C group will be influenced by the temptation of T group, so the effects of T group will spill over to C group, leading to an underestimation of the difference in effects, which is false negative.

Novelty effect / Primacy effect

When a new product is just launched, the audience’s reaction may not be real, it could be due to curiosity or bias.
Extend experiment duration and wait for the novelty effect to disappear.
OR limit the sample size to new users, because their perspectives and experiences are fresh.

Evaluation metrics have opposite results

Ex. conversion rates increase, but retention rates decrease
The trade-off of the company’s long-term and short-term goals

Summary

Define Null & Alternative Hypothesis (clarify research question)
Metric Definition
Sample Size Calculation (one of power analysis application)
Randomization
Post-Test Analysis
Customer Targeting and strategies optimization (Please find more details in this article)

If you find this article helpful to you, please click clap and follow to inspire me. I will publish related blogs on data science and statistical analysis regularly!

Thanks for your reading and feel free to leave comments and discuss!