Book Summary: “Trustworthy Online Controlled Experiments” [Part IV.]

Weonhyeok Chung
4 min readOct 13, 2022

--

(Preview) I like this part as I learned a lot about domain knowledge for experimentation platforms. This part incorporates data engineering sides to data analytics.

This series is my summary of the book on AB testing, “Trustworthy Online Controlled Experiments” (by Ron Kohavi, Diane Tang, and Ya Xu)

Link to other parts of the series:

Part I. Introductory Topics for Everyone

Part II. Selected Topics for Everyone

Part III. Complementary and Alternative Techniques to Controlled Experiments

Part V. Advanced Topics for Analyzing Experiments

Part IV. Advanced Topics for Building an Experimentation Platform

I like this part as I learned a lot about domain knowledge for experimentation platforms. This part incorporates data engineering sides to data analytics.

Photo by ThisisEngineering RAEng on Unsplash

Ch12. Client Side Experiments

Summary: Client side experiment is different from a server-side experiment. The app owner cannot control the situation sometimes. In the client-side experiment, it takes time for the user to download the updates with the experiment. In a limited timeline for the experiment, this gap in time matters. Sometimes, it is helpful to adopt natural experiments to solve the bias. Also, when the user uses multiple devices, such as mobile and web, the interaction can be problematic.

New or curious concept (or questions): Suppose we experiment on a new recommendation algorithm, then should we call it a server-side experiment? Is it a server-side experiment if the user calls data to an app with a query?

Ch13. Instrumentation

Summary: Client side instrument includes information such as users’ behavior, performance, error, or cash. The instrument can be affected by the loading speed of JavaScript. Also, the log data of users can be affected by browser, mobile, and server environments as well. Making products without instrumentation is risky as flying an airplane without a gasometer is dangerous. It is important to create a culture of correct instrumentation inside the firm. Also, evaluating log quality is necessary.

New or curious concept (or questions): What will be a one-sentence definition of the instrumentation? I think the book describes what it is but does not define it in an explicit way.

Ch14. Choosing a Randomization Units

Summary: We need to choose the randomized unit. There should be no correlation between units. In the case of page-level randomization for a recommendation system, it is difficult to satisfy such an assumption. It is general to select user level as a unit of randomization. For a product with social network functions, it is better to randomize based on user clusters. If we have users’ signed-in IDs, then it is easier to define users with the ID. But, we sometimes define user level by the device. An IP address is not suggested as the user can move out.

New or curious concept (or questions): The book suggests using the bootstrap or delta method when the unit of analysis (e.g. click-through rate per page) is rougher than a unit of randomization. I didn’t quite understand the logic in this chapter, but it will come up in chapters 18 or 19.

Ch15. Ramping Experiment Exposure: Trading Off Speed, Quality, and Risk

Summary: It is less risky to expand the experiment gradually rather than implement it to all users at once. However, when the expansion is slow, the firms’ revenue cannot grow as fast as the expansion. It is beneficial to evaluate the long-term effect, but it can be unethical to have users who do not consume products of better quality. Still, it is useful to leave some users as a control group in the first experiment. If the results from the experiment turn out to be too good to be true, we can run the same experiment again. Finally, we need to clean up the codes for the experiment when the experiment is over.

New or curious concept (or questions): I didn’t understand the concept called MPR (maximum power ramp).

Ch16. Scaling Experiment Analyses

Summary: It is necessary to clean up and process the data. The members of the firm need to agree on important metrics. And the summary of the results needs to be conveyed to the members including nonanalytics.

New or curious concept (or questions): There are two sessions about data computation in the book but wasn’t clear to me how those work. The first session seems to be similar to descriptive statistics. And the second one seems to be related to metrics from experiment results.

Three takeaways:

(1) It is necessary to understand the quality of the data from the users’ behaviors and the correct unit of randomization.

(2) The gradual expansion of the experiment can help analysts understand the user better (both short-term and long-term) but have a downside for the forgone revenue and consumers’ user experience.

(3) The summary of the results from experiments needs to be provided to all members.

Please feel free to leave any comments or questions! Thank you for reading my post.

--

--