Introduction
Around 2009, Google tested 41 shades of blue for a tool bar on its pages. In this post, we will understand the background of Google’s experiment on 41 shades of blue. We will also create a simplified end-to-end example capturing the experiment’s technical details in a multivariate test setting.
Background
To design a tool bar, a designer picked a blue that everyone on his team liked. A product manager ran a test using this blue and a different blue with a greener shade. The product manager found this from the test: users were more likely to click on the tool bar with the greener blue. Marissa Mayer asked her team to test 41 shades of blue between the two blues. The idea is to understand if different blues produce different clicks.
The philosophical question: who should have more power in shaping the products? Data science or design, assuming both are doing ethical things. It is a super interesting topic but is outside the scope of this post. Here we focus on data science only.
A simplified end-to-end example
In this example, we will test 3 shades of blue and let’s assume there are 100 users in total. Since the test involves 3 variants — 3 shades of blue, it’s no longer a simple A/B test but a multivariate test.
Metric selection
As per Alphabet’s 2022 Q2 financial result, ~58% of Alphabet’s revenue came from Google search and other. Since clicks impact the revenue stream of searches and ads, we will use click through rate as the success metric.
Test design
First we will design the test using univariate analysis of variance (univariate ANOVA). Our goal is to investigate the effects of the color on the click through rate. The univariate ANOVA model can be decomposed into: click through rate = mean + color treatment effect + residual.
State the null hypothesis and the alternative hypothesis
- Null hypothesis: CTR for blue #1 = CTR for blue #2 = CTR for blue #3
- Alternative hypothesis: not all CTRs are the same for blue #1, blue #2 and blue #3
Choose the target population
- Since we don't particularly care about a specific segment, e.g., Canada or new users, we can set the target population to the entire population, that is 100 users.
Define the unit of randomization
- Since there is no network effect, it’s appropriate to use a user-based randomization to divert individual users to pages with different colors.
Calculate the minimum sample size
- We can use the bootstrap method to calculate the minimum sample size for each variant. The reason why we use bootstrap rather than the famous sample size formula is that we can’t assume the underlying normal distribution or applying the central limit theorem.
- We will not dive deep into the bootstrap details here and if you are interested in learning more about this, please let me know. Let us assume the bootstrap method gives a minimum sample size of 2 (not a a realistic number but for easier illustration) per variant. Therefore we randomly sample 8 users (3 + 2 + 3) from the target population and divert them to one of the three pages.
Calculate the experiment duration
- Let’s assume the daily traffic is 50 users, minimum sample size is 2 per variant, and the number of variants is 3. We can plug in these number into this formula: daily traffic / (minimum sample size per variant * # variants) = 50 / (2 * 3) = 8.33 days. Please note that usually we would like to run a test for at least a week to capture the day of the week effect.
Engineering design
Second we will design the engineering aspect to surface the test so that different users will see their assigned blue and the user data will be logged into the data warehouse.
- Test setup in UI or via API
Configure the test in a UI (e.g, an experimentation platform) or via an API. Store the test definitions, setups and management in the system configuration.
- Test deployment
Implement the variant assignment (which variant is a user assigned to), variant production code (one page with blue #1, another page with blue #2, and another page with blue #3), and parameterization in the client side and the server side.
- Test instrumentation
Log user behavioral data, variant assignment and the system performance in a warehouse for test analyses and other long term monitoring.
Test analysis
After the data is logged, below is the test result summarized in a table after querying the data warehouse.
With this table, we can apply the univariate ANOVA and arrive at the following sum of squares.
With the sum of squares calculated, we can summarize them into an ANOVA table.
With the above ANOVA table, we can compare the sum of squares of color treatment with the sum of squares of residual. If they are approximately equal, we conclude that the CTRs are the same for all three blues. Below are the technical details.
Let’s set the statistical significance level as 0.01. By using an online calculator, we get p-value = p(F2,5 ≥ 19.5) = 0.00435 < 0.01. Therefore, we reject the null hypothesis and conclude that the CTRs are not the same for all three blues. So blue does make a difference in this example 💙 What do you think the result of Google’s test was?
References
Applied Multivariate Statistical Analysis, 6th Edition, Richard A. Johnson and Dean W. Wichern
Using the Bootstrap for Estimating the Sample Size in Statistical Experiments, Maher Qumsiyeh
Trustworthy Online Controlled Experiments, A Practical Guide to A/B Testing, Ron Kohavi
Logos and pictures are from various websites — thanks to the designers