Book Summary: “Trustworthy Online Controlled Experiments” [Part I.]

4 min readOct 10, 2022

This is my summary of the book on AB testing, “Trustworthy Online Controlled Experiments” (by Ron Kohavi, Diane Tang, and Ya Xu)

During my summer, I studied a well-known book for AB testing titled, “Trustworthy Online Controlled Experiments” (by Ron Kohavi, Diane Tang, and Ya Xu) with data analysts in PAP in Korea.

I find the book useful for those who want to get a sense of AB testing in practical business. It has a good balance between practice and theory. Here, I summarize the first four parts (out of five parts) of the book in my own words and state some questions I had after reading it.

I believe my summaries can help people who are interested in this book or AB testing, in general, get a sense of what this book is roughly about.

Link to other parts of the series:

Part II. Selected Topics for Everyone

Part III. Complementary and Alternative Techniques to Controlled Experiments

Part IV. Advanced Topics for Building an Experimentation Platform

Part V. Advanced Topics for Analyzing Experiments

Part I. Introductory Topics for Everyone

Part I provide a nice why and how to run experiments, common pitfalls while running an experiment, and why and how to establish an experimental platform and culture.

Ch 01. Introduction and Motivation

Summary: Experiments (or A/B testing) help the decision-making of a firm with data. The analyst can set relevant metrics for the environment and evaluate whether the new policy enhances the metric. In addition, the analyst can test a new MVP (Minimum viable product) with low risk.

New or curious concept (or questions): What is CTR (Click Through Rate)?

Ch 02. Running and Analyzing Experiments

Summary: After setting up the important metric, the analyst evaluates statistical hypothesis testing. In general, the analyst selects a sample size with 80 to 90 % of statistical power. When interpreting the results of the experiment, not only statistical significance but also practical significance is necessary.

New or curious concept (or questions): cash-hit-ratio, guard rail metric

Ch03. Twyman’s Law and Experimentation Trustworthiness

Summary: A result that looks too good to be true can signal to analysts that an error exists in the results. First of all, there is a risk of interpreting the estimate wrongly. For example, the analyst may interpret the results incorrectly, or pick only the statistically significant results from numerous tests (“picking AB Tests”). The internal validity can be problematic when SUTVA (Stable Unit Treatment Value Assumption) is violated, users in the treatment group are redirected to the treatment group, or the capacity of back-end changes from an experiment. Also, when certain groups are sampled excessively, it is hard to generalize the results (“external validity” fails). In addition, the analyst needs to consider the heterogeneity of the treatment by users’ characteristics.

New or curious concept (or questions): Sample Ratio Mismatch (SRM)

Ch04. Experimentation Platform and Culture

Summary: As the firm increases, the size of the experiment also increases. In an early stage, the firm conducts experiments less than 10 times, but it can implement more as they grow. The firm can create its own experimental platform or borrow tools from outside. There are solutions to remedy problems when multiple experiments overlap. In addition, the firm needs to provide visualized results of experiments to members.

New or curious concept (or questions): It was unclear to me how to remedy or solve the issues when multiple experiments overlap.

Regarding the questions or new concepts I stated before, I will probably write about them in the near future.

Three takeaways (from Part I.):

(1) When people have conflicting views about new features, we can run experiments to test which hypothesis is true (Experiments provide evidence for the effect of the new product features!).

(2) Data analysts need to know the statistical concepts (e.g. SUTVA, sequential testing, and multiple hypothesis testing) relevant to validate the results from the experiments.

(3) Data analysts need to check whether relevant assumptions hold in their experiments (e.g. If control group users are affected by treatment group users, then the estimates are biased).

One additional comment: I think the best way to learn from this book is to discuss the practical issues one had from their own experience with others. For me, as I was more on the theoretical side, I learned a lot from analysts who have experience inside the firm. Maybe if you are in a practical domain, it is good to discuss your problem with those who are good at statistical theories.

Regarding the questions or new concepts I stated before, I will probably write about them in the near future.

For any of reading my posts, please feel free to write any comments/questions/suggestions.

Book Summary: “Trustworthy Online Controlled Experiments” [Part I.]

Written by Weonhyeok Chung