A/B Testing

That’s the good thing about being halved.

Overview

  1. A/B testing is useful for you to climb the current mountain, but not so useful if you want to compare between two different mountains.
  2. A/B testing cannot deal with new experiences due to missing baseline and long adaption time. A/B testing also cannot tell you if you are missing anything in your product.
  3. Start with the initial hypothesis. Refine it by choosing an appropriate quantitative metric. Consider total number, rate and probability for example. Beware the needed time, we can’t wait too long!
  4. Learn some statistics: distribution, confidence interval, null/alternative hypothesis, significance, reject/fail to reject null hypothesis, pooled standard error, etc.
  5. Beware the impact of sample size: small sample size means high false-negative rate (or high beta, less sensitive), large sample size means high false-positive rate (or high alpha, more sensitive).
  6. Analyze your data (first step): estimate pooled standard error (from pooled probability for example), estimate true difference between experimental and control groups, calculate margin of error (from standard error and z score), then find the confidence interval is true difference ± margin of error.
  7. Analyze your data (second step): look at your confidence interval. if the lower bound is larger than minimal detectable effect (or minimum difference), we can reject the null hypothesis and smile; if the upper bound is smaller than minimal detectable effect, we fail to reject the null hypothesis and smile; otherwise, we probably need to collect more data or make additional tests, sigh.
  8. Life is not certain. If you cannot make a decision based on data, you need to consider more: business risks, return on investment, users’ feedbacks, etc.

Policy and Ethics for Experiments

  1. Four principles of reviewing an experiment: risk, benefit, choice and privacy.
  2. Risk: if the risk exceeds the minimal risk.
  3. Benefit: what benefit would be after completing the experiment.
  4. Choice: what are alternative services user might have.
  5. Privacy: what is expectation of privacy and confidentiality, how sensitive the collected data is, and what are re-identification risks for users.
  6. Ask yourself: Are users being informed? What user identifiers are tied to the data? What type of data is being collected? What is the level of confidentiality and security?
  7. Internal training process for policy and ethics is necessary for a company that might use users’ data.

Choosing & Characterizing Metrics

  1. Define metrics: Sanity/invariant-checking metrics vs. evaluation/business metrics. Single metric vs. multiple metrics vs. composite metric (e.g. OEC). Identify the applicability scope of a metric.
  2. Start with a basic funnel model. Expand the funnel to a detailed, quantifiable metric funnel. Then, consider some horizontal effects like different platforms. Finally, Each transition between stages should also be a metric, consider probability vs.rates.
  3. Some metrics are “bad” because they take too long time or are hard to track. Other techniques are brainstorming, validating metrics, external data, proxies, and retrospective analysis.
  4. User experience research (UER), focus groups and surveys help gather additional data. Beware depth vs. coverage.
  5. Beware that technical details and bugs can affect the data that you collect. Use a issue vs. metric matrix to analyze the effects.
  6. De-bias your data by filtering out spams and frauds and segmenting data by region or demographics.
  7. Summary metrics: (1) sums and counts (2) means, medians and percentiles (3) probabilities and rates (4) ratios
  8. Beware the sensitivity and robustness of metrics. For instance, means are sensitive, medians are robust, and percentiles are in between. Evaluate sensitivity and robustness with previous experiments and retrospective analysis (e.g. looking at logs).
  9. Variability: confidence interval depends on standard deviation and distribution, so you need to consider the effect of different distribution. Estimate empirical variability by “A/A Tests” (control vs. control), either running separate tests or using “bootstrap method” (random subset).
  10. Empirical confidence interval: sort data, choose a confidence level (say 95%), choose the boundary element as margin of errors (say 2th and -2th elements).

Designing An Experiment

  1. Choose a “subject”: unit of diversion. The commonly used are user id, anonymous id(cookie), event, device id, ip address, etc.
  2. Considerations: the consistency of diversion, you need different level of diversion unit for different type of experiments; ethical considerations for diversion.
  3. Unit of analysis might be different from unit of diversion. Beware that unit of analysis might cause higher empirical variance due to data correlation.
  4. Choose a population: inter-user experiment vs. intra-user experiment; what is your target population.
  5. Use cohort instead of population when you are looking for learning effects, examining user retention, want to increase user activity, or anything requiring user to be established.
  6. Sizing matters. You certainly can modify the statistical metrics to affect sizing, but you also can decreasing sizing by redesign your experiment.
  7. Duration vs. Exposure: limit the duration of your experiment, and control the size of exposed users for safety reason.
  8. Learning effects: users may adapt to a change through time.

Analyzing Results

  1. Sanity checks: choose a set of invariants, then check the invariants if they are within margin of errors (you may need a priori probability for computing standard error), and if the sanity check fails, figure out why.
  2. Single metric: determine statistical significance of confidence interval, determine statistical significance of sign test.
  3. Simpson’s paradox: the effect of sub-groups driving your statistical results.
  4. Multiple metrics: calculate the overall false positive rate based on individual confidence levels, or more conservatively, using Bonferroni correction.
  5. Other strategies: family-wise error rate (FWER), false discovery rate (FDR)
  6. Ask you a few questions before you make conclusion: Do I have statistically significant and practically significant results to justify the change? Do I understand the change regarding user experiences? Is it worth it for business purposes?
  7. Remember that in practice you may need to make judgement call. It is actually a business decision rather than a problem set of statistics.

This note is based on the original content of Google’s “A/B Testing” course on Udacity. Highly recommend this course if you want to learn more about A/B testing.