Multivariate test — Google’s 41 shades of blue

Wendy Hu
5 min readSep 20, 2022

--

Shades of blue

Introduction

Around 2009, Google tested 41 shades of blue for a tool bar on its pages. In this post, we will understand the background of Google’s experiment on 41 shades of blue. We will also create a simplified end-to-end example capturing the experiment’s technical details in a multivariate test setting.

Background

To design a tool bar, a designer picked a blue that everyone on his team liked. A product manager ran a test using this blue and a different blue with a greener shade. The product manager found this from the test: users were more likely to click on the tool bar with the greener blue. Marissa Mayer asked her team to test 41 shades of blue between the two blues. The idea is to understand if different blues produce different clicks.

The philosophical question: who should have more power in shaping the products? Data science or design, assuming both are doing ethical things. It is a super interesting topic but is outside the scope of this post. Here we focus on data science only.

A simplified end-to-end example

In this example, we will test 3 shades of blue and let’s assume there are 100 users in total. Since the test involves 3 variants — 3 shades of blue, it’s no longer a simple A/B test but a multivariate test.

An A/B test vs. a multivariate test

Metric selection

As per Alphabet’s 2022 Q2 financial result, ~58% of Alphabet’s revenue came from Google search and other. Since clicks impact the revenue stream of searches and ads, we will use click through rate as the success metric.

Test design

First we will design the test using univariate analysis of variance (univariate ANOVA). Our goal is to investigate the effects of the color on the click through rate. The univariate ANOVA model can be decomposed into: click through rate = mean + color treatment effect + residual.

State the null hypothesis and the alternative hypothesis

  • Null hypothesis: CTR for blue #1 = CTR for blue #2 = CTR for blue #3
  • Alternative hypothesis: not all CTRs are the same for blue #1, blue #2 and blue #3

Choose the target population

  • Since we don't particularly care about a specific segment, e.g., Canada or new users, we can set the target population to the entire population, that is 100 users.

Define the unit of randomization

  • Since there is no network effect, it’s appropriate to use a user-based randomization to divert individual users to pages with different colors.
Unit of randomization

Calculate the minimum sample size

  • We can use the bootstrap method to calculate the minimum sample size for each variant. The reason why we use bootstrap rather than the famous sample size formula is that we can’t assume the underlying normal distribution or applying the central limit theorem.
  • We will not dive deep into the bootstrap details here and if you are interested in learning more about this, please let me know. Let us assume the bootstrap method gives a minimum sample size of 2 (not a a realistic number but for easier illustration) per variant. Therefore we randomly sample 8 users (3 + 2 + 3) from the target population and divert them to one of the three pages.

Calculate the experiment duration

  • Let’s assume the daily traffic is 50 users, minimum sample size is 2 per variant, and the number of variants is 3. We can plug in these number into this formula: daily traffic / (minimum sample size per variant * # variants) = 50 / (2 * 3) = 8.33 days. Please note that usually we would like to run a test for at least a week to capture the day of the week effect.

Engineering design

Second we will design the engineering aspect to surface the test so that different users will see their assigned blue and the user data will be logged into the data warehouse.

Architecture diagram for statistical tests
  • Test setup in UI or via API

Configure the test in a UI (e.g, an experimentation platform) or via an API. Store the test definitions, setups and management in the system configuration.

  • Test deployment

Implement the variant assignment (which variant is a user assigned to), variant production code (one page with blue #1, another page with blue #2, and another page with blue #3), and parameterization in the client side and the server side.

  • Test instrumentation

Log user behavioral data, variant assignment and the system performance in a warehouse for test analyses and other long term monitoring.

Test analysis

After the data is logged, below is the test result summarized in a table after querying the data warehouse.

Test result

With this table, we can apply the univariate ANOVA and arrive at the following sum of squares.

Univariate ANOVA

With the sum of squares calculated, we can summarize them into an ANOVA table.

ANOVA table

With the above ANOVA table, we can compare the sum of squares of color treatment with the sum of squares of residual. If they are approximately equal, we conclude that the CTRs are the same for all three blues. Below are the technical details.

F-statistic

Let’s set the statistical significance level as 0.01. By using an online calculator, we get p-value = p(F2,5 ≥ 19.5) = 0.00435 < 0.01. Therefore, we reject the null hypothesis and conclude that the CTRs are not the same for all three blues. So blue does make a difference in this example 💙 What do you think the result of Google’s test was?

References

Applied Multivariate Statistical Analysis, 6th Edition, Richard A. Johnson and Dean W. Wichern

Using the Bootstrap for Estimating the Sample Size in Statistical Experiments, Maher Qumsiyeh

Trustworthy Online Controlled Experiments, A Practical Guide to A/B Testing, Ron Kohavi

Logos and pictures are from various websites — thanks to the designers

--

--