Book Summary: “Trustworthy Online Controlled Experiments” [Part V.]

Weonhyeok Chung
4 min readOct 13, 2022

--

(Preview) This part is an advanced topic to analyze experiments. Multiple hypothesis testing, A/A test to check the validity of the test, and check whether spillover between treatment and control groups exists.

This series is my summary of the book on AB testing, “Trustworthy Online Controlled Experiments” (by Ron Kohavi, Diane Tang, and Ya Xu)

Links to other parts of the series:

Part I. Introductory Topics for Everyone

Part II. Selected Topics for Everyone

Part III. Complementary and Alternative Techniques to Controlled Experiments

Part IV. Advanced Topics for Building an Experimentation Platform

Part V. Advanced Topics for Analyzing Experiments

This part is an advanced topic to analyze experiments. Multiple hypothesis testing, A/A test to check the validity of the test, and check whether spillover between treatment and control groups exists.

Photo by Kit Suman on Unsplash

Ch17. The Statistics behind Online Controlled Experiments

Summary: This chapter covers the statistical aspects of the experiment. We test hypotheses and discuss statistical power. To compare the treatment group and control group, we need not only the average but also the variance of the two groups. When the sample size is large enough, CLT (Central Limit Theorem; when the sample is large enough, the sampling distribution of the sample mean converges to the normal distribution) holds. However, depending on the sample size of the two groups, their distribution can differ. Meanwhile, when we select the sample size, we set statistical power (1 — type II error) as 80% in practice. There are a number of ways to check multiple hypothesis testing.

New or curious concept (or questions): Regarding multiple hypothesis testing, I didn’t quite understand the advantages and disadvantages of them from the book. I want to know more about what assumptions they make, and how they interpret results.

Ch18. Variance Estimation and Improved Sensitivity: Pitfalls and Solutions

Summary: Statistical concepts for experiments — statistical significance, p-value, statistical power, confidence interval, etc —are related to variance. We need to understand the computation and ways to reduce them.

New or curious concept (or questions): I was wondering if variance for the “relative difference (or percent delta)” is also used for hypothesis testing.

Ch19. The A/A Test

Summary: To enhance the trustworthiness of the experiment, we first need to conduct an A/A test. The chapter covers how to run an A/A test, and what to do when the test fails.

New or curious concept (or questions): The book said that when we run user-level randomization, the unit for metric can differ and mentioned PLT (page-load-time) as an example. I didn’t quite understand this part.

Ch20. Triggering for Improved Sensitivity

About this chapter: This chapter is about how to deal with users who do not take up the treatment.

My view about this chapter: I think the method in this book is too complicated. I would rather use the instrumental variable method (causal inference technique widely used in Econometrics). I would analyze the intent to treat effect and treatment on the treated effect.

Ch21. Sample Ratio Mismatch and Other Trust-Related Guardrail Metrics

Summary: Experiments can be unreliable when the treated group and control group differs. When the sample ratio between the two groups differs, it alarms that the experiment has pitfalls.

New or curious concept (or questions): What are other important guardrail metrics and how can we deal with them when we find problems from them?

Ch22. Leakage and Interference between Variants

Summary: “Spillover (or leakage, interference)” can happen. One unit’s action can influence another unit. For example, suppose that I get treated by the “people-you-may-know” service and I send a request to the person. Even if the person was in a control group, both of them get a “friendship” outcome. This chapter talks about cases when SUTVA violation does not hold.

New or curious concept (or questions): I read 3~4 times for my presentation in an airplane but I find the book skips details. I believe the author knows the concept, but can’t understand the concept without reading the relevant papers they cite.

In my other post, I added my own detailed explanations and examples from the book.

Ch23. Measuring Long-Term Treatment Effects

Summary: Short-term results from experiments sometimes fall apart with long-term goals. Users can learn the feature over time, experiments can be delayed, or the ecosystem can change. To complement these pitfalls, long-term evaluation such as cohort analysis, post-period analysis, and time-staggered experiments can be beneficial.

New or curious concept (or questions): What is attribution?

Three takeaways:

(1) Spillover between treatment and control groups can happen.

(2) We need to check whether the treatment group and control group are similar.

(3) Sometimes short-term effects do not guarantee long-term effects.

Please feel free to leave any comments or questions! Thank you for reading my post.

--

--