How Google Conducts more/better/faster experiments

Google initially published a paper around 2010 on setting up the experimentation infrastructure to support running faster, better and more experiments. Here’s the link. The following is a cursory attempt at understanding Google’s approach.

Overview: Google conducts experiments to modify user interfaces and change ML algorithms that affect ranking or content selection. To support this goal, Google needs:

  1. Sufficient Infrastructure to support
  2. A wide range of tools to support the usage of the infrastructure
  3. Great educational processes to help users make better use of these assets

Defining an experiment:

  1. Representative segment of the population
  2. A change that’s interested to be tested (e.g. visible changes such as UI or non visible changes such as changes to the underlying ML algorithm in ranking output)
  3. A split on the representative segment of the population (test set and the control set)
  4. Response to the change that’s of interest. This is the objective. (e.g. Click through rate, conversion rate, sales or other types of business metrics etc.)

At Google, experiments are run to:

  1. Test out new product features
  2. Explore the space around existing features

Current Challenges (As of 2010 at Google?)

Keeping up with the rate of innovation is difficult. Testing as many ideas as possible is necessary but not supported.

Challenges related to experiment infrastructure:

  1. Single layer of experiment structure: Every query is in at most one experiment. This is easy to use and flexible but insufficintly scalable.
  2. Multi-factorial experiment structure: Every parameter (factor) can be experimented on independently. Each experimental value for a parameter overlaps with every other experiment value for all of the other parameters. Each experiment query will be in N experiments simultaneously and N is the number of parameters (factors). Even this structure is impractical since at Google, there are thousands of parameters to be experimented on and not every factorial combo should be tested (pink text on pink background for example).

Goal of a better experiment infrastructure:

  1. More: scalability to run more experiments simultaneously and flexibility to run different configurations/sizes while maintaining the statistical significance
  2. Better: Invalid experiments shouldn’t be run at all. Valid but bad experiment should be caught quickly and disabled. Standardized metrics should be made available
  3. Faster: Quick and easy to set up an experiment

Proposed solution:

  1. Partition the parameters into subsets. Every subset contains parameters that cannot be varied independently of each other.
  2. Every subset is associated with a layer that contains experiments. Each experiment query would be in N experiments and N is the number of layers.
  3. Each experiment can only modify parameters associated with its layer and the same parameter cannot be associated with multiple layers.

Context around Experimentation Literature:

There are three areas of experiments:

  1. Single and multi-factorial experiments in the traditional statistical literature.
  2. Running Web experiments. Kohavi has done a paper on this area in a relatively comprehensive way.
  3. Interleaved experiments. This is the specific type of experiment focused on evaluating ranking changes.

Structures and key concepts that support this proposed experimentation system

  1. Domain: This is a segmentation of traffic
  2. Layer: This is a subset of the system parameters
  3. Experiment: This is a segmentation of traffic where Zero or more system parameters can be given alternate values that change how the incoming request is processed
  4. Launch layers: They are different from experiment layers in that they are contained within default domain. They are separate partitioning of parameters. They are used to gradually roll out changes to all users without interfering with existing experiments and to keep track of these roll outs in a standardized way.
  5. Diversion Types: How domain is diverted. This can be done either randomly or through using Cookie Mods to generate traffic diversions.
  6. Canary new code: These are methods used to test new code on a small amount of traffic and make sure that the new code isn’t buggy. The code would work when there’s more traffic.
  7. Binaries: A collection of code/scripts with functionalities. In this context, it is some type of an executable.

Given the infrastructure, the process of evaluating and launching a typical feature follows the following:

  1. Implement new feature in the right binary (code review, binary push, default values etc.)
  2. Create a canary experiment with small traffic to ensure that the feature is working. If not working, then more code needs to be written/revised.
  3. Create experiment or a set of experiments to evaluate the feature. Experimentation component includes specification of diversion type, parameters, conditions and affected system parameters
  4. Evaluate the metrics from experiment. Depending on the results, additional iterations may occur. Additional iterations occur by modifying or creating new experiments.
  5. If feature in question is launchable, then launch process is initated:
  6. Launch layer and launch experiment is then created. This is to gradually ramp up the launch layer experiment, and then finally delete the launch layer and change the default values of the relevant parameters

Tools and Processes to support overlapping infrastructure

Tools: Data File Checks Utilities

Automated data checks on input files to prevent useless/broken experiments from taking place. Checks include the following:

Consistency and constraint errors

  1. Uniqueness of id’s
  2. Whether experiment is in the right layer
  3. Whether layer has enough traffic to support
  4. Is the experiment asking for traffic claimed by other experiments
  5. Basic experiment design checks (does experiment have a control…)
  6. Is the domain diverted in the same way as the others?

Tooling: Real-time Monitoring & Experiment analyses

These are tools that capture and track basic metrics as quickly as possible in order to determine if there’s anything unexpected happening. Tool utilities include:

  1. Choose metrics to track
  2. Setting expected range for individual metrics and guardrail on these metrics
  3. Guardrail communications: how and where communications are delivered
  4. Experiment shut down
  5. Configuring and adjusting experiment parameters

While the above is used to enable simultaneous experiments and expediting the experiment process, experiment analysis tool is the other side of the coin.

The experiment analysis tool is used to provide accurate values for the suite of metrics that experimenters examine to evaluate the experiment. Multiple metrics can be combined into a single objective function but the experimenters can also examine a suite of metrics.

Key design goals of an experiment analysis tool include:

  1. Correctly computed and displayed confidence intervals
  2. Good UI for ease of use and understanding
  3. Support for slicing: aggregate numbers should be drilled down
  4. Extensibility: custom metrics should be added

Experiment Design, Sizing, Pre-post periods

Effective Size of an experiment:

N = (1/control query +1/experiment query)^-1

In this iteration, the estimates of standard error can be a challenge.

Pre period is a period of time prior to starting the experiment where the same traffic is diverted into the experiment but with no changes made. This is to ensure that traffic diverted is really similar to the control population traffic.

Post period is the same but after the conclusion of the experiment. This is useful for determining the if there are any learned effects from running the experiment.

Experiment infrastructure adoption education

Tools and infrastructure discussed above are technical considerations required for enabling more, better, faster experimentation. On the people side, experiments should be:

  1. Well designed
  2. Result of an experiment are understood and disseminated

Experiment council:

This is a council of engineers that review a light weight checklist that experimenters fill out prior to running the experiments. These questions address:

  1. Basic experiment characterization (what does the experiment test, what are the hypotheses)
  2. Experiment set up (which experiment parameters are varied, what each experiment or set of experiments tests…)
  3. Experiment diversion and triggering (what diversion type and which conditions to use for diversion, what proportion of diverted traffic triggers the experiment)
  4. Experiment analysis (which metrics are of interest, how big of a change the experimenter would like to detect)
  5. Experiment sizing and duration (given the affected traffic, the experiment has sufficient statistical power to detect the desired metric changes)
  6. Experiment design (whether pre and post period is warranted)

The benefit of this is that it is a useful way for disseminating updated best practices with regards to experimentation. Checklist is a hosted web application.

Interpreting the data

The other process put in place is a forum where experimenters bring the results to discuss with experts. The goal of this forum is to:

  1. Ensure that the experiment results are valid
  2. Given valid results, make sure that the metrics being looked are a complete set with regards to understanding what’s happening.
  3. Given a full set of results, discuss and agree on whether overall the experiment is a positive or negative user experience, so that decision makers can use the data to determine whether to launch the change.