Experimental Study in Tech: A/B Testing Structure

Experiment Design & Metric Definition & Post-Test Analysis & A/B test Pitfalls and Strategies

TING DS
5 min readJan 31, 2024

Statisticians and data scientists often answer a question: Does X cause Y?

There are two types of methods o answer this question:

  • Observational Study: more details this article
  • Experimental Study (also called A/B Testing in tech industry)

Experimental Study

By conducting experiments on a small group of users (artificial intervention), estimate the causal effect of X intervention on business metric Y, to guide future company strategies.

  • In tech industry, X can be new feature in web/app, new product or new promotion strategy etc. Y can be conversion, user engagement etc.
  • In clinical industry, X can be a new drug, new dose or new treatment strategy etc. Y can be side effects, event rates, health condition etc.
  • For the success of experimental study, in addition to mastering the statistical foundation of experimental research, it is also necessary to have specific domain knowledge in specific industries and projects, such as understanding of variables and characteristics of user behavior. Because these are the key factors in experimental design and post-test analysis.

A/B Testing Structure

  • Define Null & Alternative Hypothesis (clarify research question)
  • Metric Definition
  • Sample Size Calculation (one of power analysis application)
  • Randomization
  • Post-Test Analysis
  • Customer Targeting and strategies optimization (Please find more details in this article)
Example of user funnel in A/B testing

1. Define hypothesis (clarify research question)

  • Target Population: all user? user from specific subgroup?
  • Unit of Division: user id? cookie? — How to define a subject or unit in this experiment
  • Treatment: intervention of interest in the experiment (X)

usually define only one intervention

but if >1 intervention group: ANOVA + multiple t-test

or if ratio evaluation metric have >2 categories: Chi-square test

must clearly define the intervention, which cannot be a vague concept.

  • Evaluation Metric: outcomes of interest in the experiment (Y)

2. More than one metric

Usually we need to define two types of metric

Evaluation Metric

  • Counts: DAU ( # of Daily Active User), WAU, MAU

It’s important to specifically define what user behavior is considered as “Active”

Login time exceeds 5 minutes? Click on a certain interface (Ins story)?

  • Distribution Metric: average session time on a site, average number of clicks before purchase/conversion
  • Ratio / Probability: conversion rate (specific user action: purchase, clicks, upgrade etc.), retention rate, user stickness (DAU/MAU)

Invariant Metric

  • Metrics that shouldn’t show significant changes during the experiment process, because the invariance of these metrics maintains the randomness quality of the experiment.
  • The selection of invariant metric is related to the selection of evaluation metric, the overall principle is to ensure the randomness of the experiment.

Ex. Pre-conversion behavior, Device type, User count/Visits, Geographic location

3. Sample Size Calculation

Sample Size Calculation Formula
  • [2]: treatment group and control group
  • [μ_c — μ_t]: effect size

The practical significance difference between treatment group and control group that company would like to expect

The smaller the effect size, the larger the required sample size, and the longer the experimental period. However, when the sample size is large enough, even a tiny effect can be significant. But this tiny effect has no meaningful impact on business decisions.

  • [σ]: Estimated Standard Deviation of evaluation metric (outcome), depends on expert domain knowledge OR historical data
  • [α]: significance level (acceptable type I error, 0.1, 0.05, 0.01); lower α, then higher sample size requiremnet
  • [β]: 1 - Power (0.8) = 0.2 = β; lower β, higher power, then higher sample size requiremnet

4. Randomization

Two meanings:

  • Sample in experiment can represent the target population
  • All covariates are uniformly distributed in two groups.

5. Post-test Analysis

Sanity Check

  • The invariant metric should not show significant differences.

Parametric Test

If evaluation metric follows Normal Distribution

If evaluation metric is binary variable, also approximate Normal Distribution

  • Student’s T-test: SD of evaluation metric is similar among 2 groups
  • Welch’s T-test: SD of evaluation metric is different among 2 groups
  • ANOVA test: when treatment group > 1

when treatment group > 1

1. ANOVA: check if at least one group is significant different with other groups

2. Multiple Paired T-test: check which two groups are significant different (with p-value adjustment by Bonferroni Correction OR False Discovery Rate)

Non-Parametric Test

If evaluation metric doesn’t follow Normal Distribution or SD is different or N is small

  • Wilcoxon Rank Sum Test (Mann Whitney U test): evaluate rank instead of actual value of evaluation metric

If evaluation metric is categorical variable (≥ 2)

  • Chi-square test : when n>1000
  • Fisher’s exact test: when n<1000

Although non-parametric test is less power and less precise, non-parametric test is more robust (small sample size; skewed distribution; outlier exists; different SD). The same experimental result can usually be evaluated with both parametric and non-parametric tests simultaneously.

A/B testing pitfalls & solution

Stop A/B test too early

  • Generally, experiment duration = sample size / average daily traffic
  • But if calculated experiment duration is less than 1 weeks, or not a multiple of 7 days, we should extend the experiment duration to 2 weeks (2 business cycles), accounting for seasonality effect of weekends and holidays.

Network effect

  • In social media platforms, T group and C group belong to the same social circle. The behavior of C group will be influenced by the temptation of T group, so the effects of T group will spill over to C group, leading to an underestimation of the difference in effects, which is false negative.

Novelty effect / Primacy effect

  • When a new product is just launched, the audience’s reaction may not be real, it could be due to curiosity or bias.
  • Extend experiment duration and wait for the novelty effect to disappear.
  • OR limit the sample size to new users, because their perspectives and experiences are fresh.

Evaluation metrics have opposite results

  • Ex. conversion rates increase, but retention rates decrease
  • The trade-off of the company’s long-term and short-term goals

Summary

  • Define Null & Alternative Hypothesis (clarify research question)
  • Metric Definition
  • Sample Size Calculation (one of power analysis application)
  • Randomization
  • Post-Test Analysis
  • Customer Targeting and strategies optimization (Please find more details in this article)

If you find this article helpful to you, please click clap and follow to inspire me. I will publish related blogs on data science and statistical analysis regularly!

Thanks for your reading and feel free to leave comments and discuss!

--

--

TING DS

Lover for Data Science & Statistics. Write as I learn