Accelerating Innovations Through Experimentation: Part I

Published in

Glassdoor Engineering Blog

13 min readJul 24, 2020

Introduction

At Glassdoor, many business decisions are data-driven. Experimentation (aka. A/B testings) is a very popular technique because it usually can measure granular changes in a product without mistakenly taking into account changes that were caused by outside factors. Therefore a powerful experimentation suite is crucial to speed up our innovations. It usually consists of two parts: 1) an experiment management system including traffic allocation; 2) a data pipeline for result analysis and reporting.

We are going to talk about experiment management and traffic allocation here, and leave the second part to a later article. With many experiments running simultaneously company-wide on different devices, system components, and languages, we need a flexible, reliable, and easy to use system to manage them with minimum efforts. We have compared popular available options in the market and also considered our previous experience, we decided to design and implement our very own experimentation traffic allocation platform to suit our needs. We named it Darwin, as the purpose of experimentation is to select the best variation.

Overview

In our experimentation framework, the interaction between an application and experiments is done by changing the application’s behavior based on the value of one or more parameters. Therefore, in a nutshell, Darwin is designed to answer such a question: what is the value of a parameter given a request sent to Darwin API. For that, we create an experiment configuration, which is used by Darwin to answer such a question. Experiment configuration files are checked into to GIT repository for review, approval, and versioning purposes.

Goal #1: experiment configuration should be human readable and human friendly.

In the following example, we are running an experiment that changes the value of parameters “fontSize” and “fgColor”. The application reads these values from Darwin API and renders the UI based on them.

The above example only shows some of the elements in a Darwin configuration. A complete hierarchy of elements is illustrated by the following diagram. A context can have multiple domains, and a domain can have multiple experiments and treatments. We can set up an experiment with multiple treatments, and the same treatment can be used in multiple experiments if needed.

Context

Goal #2: usage of Darwin in one team should be independent from other teams.

Darwin is designed to be used by many teams/organizations company-wide. It is impractical to always coordinate with other teams whenever an experiment or a parameter is created (e.g., if parameter “fontSize” is used by multiple teams in their applications, possibly for different purposes). We should allow one team to use Darwin without knowing how other teams use it. Therefore, we created Context as the outermost layer of the structure.

Any element must be contained within a Context, and there is no interaction or dependency across Contexts. It is perfectly fine to create an element (e.g. parameter) with the same name in multiple Contexts, because an application must specify the context name before it can interact with Darwin API to get the value of a parameter. Conceptually, a Context maps to the territory of a team.

However, an application or server can load multiple Contexts as different independent objects in memory. This allows multiple teams’ code to run in the same application or server and can interact with their own Contexts independently. The Darwin compiler (discussed later) also compiles each Context separately.

Domain

Goal #3: Within a Context, we want to have the capability of running multiple experiments at the same time for a request sent to Darwin API, if those experiments are orthogonal.

More specifically, if a treatment (discussed later) of an experiment affects an experiment metrics (usually through affecting user’s behavior) equally on all treatments of other experiments, then this experiment can run with those other experiments. For example, we should allow an experiment of changing font size to run with another experiment of changing web page background color, under the assumption that these two changes affect users’ behavior independently.

Goal #4: On the other hand, we also want to have the capability of making certain experiments mutually exclusive, meaning that only one of the experiments in a group can run at a time for a request sent to Darwin API.

This feature is useful and usually required when testing multiple aspects of an application’s feature. For example, we may want to start a new experiment to test a new Search Ranking algorithm while there is already another Search Ranking algorithm experiment running and we don’t want to affect that experiment at all.

The above two goals are achieved by creating the Domain layer. Any experiment belongs to one and only one domain. All domains are evaluated for a request, and at most one experiment in each domain can run for the request. Note that because a request can be allocated to experiments in multiple domains, we allow a parameter to be defined in only one domain, otherwise we don’t know from which domain we should return the parameter’s value. Darwin compiler verifies and enforces that.

A domain definition consists of a name, the total number of segments to divide the traffic, and “unit id” which is the name of an attribute in requests for traffic allocation. For example:

domain ui_domain/100/user_id

Experiment

An experiment specifies an A/B test, which contains one or more treatments (with one of them designated as “control” for experiment analysis purpose), allocation (how much traffic the experiment can run on), and zero or more filters. Note that an experiment name starts with a domain name followed by a dot.

Filter

Goal #5: we should be able to specify if a request is eligible for an experiment based on the request’s attributes.

This feature is frequently required as a feature may be designed to work only on a subset of requests. For example, when we A/B test an English Search Ranking algorithm, we want to run this experiment only on the requests coming from U.S. users. Furthermore, if we allocate all U.S. users’ requests to this experiment, we still want to be able to allocate certain non-U.S. users (e.g. French users) to another experiment in the same domain with a non-U.S. filter (e.g. for A/B testing a French Search Ranking algorithm). This requires that we segment our requests by their attributes (e.g. location of user) before allocating them to an experiment.

An experiment can specify more than one filter (e.g., one for user’s location and one for user’s device), and one filter can have multiple values. A request’s attributes must match at least one of the values of a filter, and must have such a match on every filter to be eligible for the experiment. For example, if we define an experiment with the following filters:

deviceType-filter: desktop, mobile

region-filter: us, gb

Then a request with deviceType=desktop and region=gb is eligible for this experiment, while a request with deviceType=desktop and region=fr is ineligible.

Allocation

Goal #6: the chance a request is allocated to an experiment should be proportional to the specified allocation of the experiment in a domain.

This implies a request is assigned to an experiment randomly.

Goal #7: the randomizations between experiments in different domains are orthogonal (in other words, independent).

Formally, this means if one experiment is defined in a domain, then it should run on the same portion of requests allocated to each experiment in another domain. In the diagram below, the allocation of experiment A in domain 1 is 75%, which means 75% of requests in an experiment (such as experiment B) in domain 2 should also run experiment A.

Goal #8: the experiment allocation should be deterministic. This means if the same request is sent multiple times, we should always allocate it to the same experiment(s), unless the allocated experiment setting is changed.

We achieve these goals of allocation by assigning a unique hashcode to each experiment, and computing a hashcode of the “unit id” attribute of a request (specified in domain definition), then using the combination of both hashcodes as the key of a randomization algorithm for allocation. We also validated the allocation of multiple A/A tests in multiple domains in simulation as well as live traffic.

Goal #9: a feature after experimentation can be easily adopted.

Not all traffic must participate in any experiment. For the traffic not in any experiment in a domain, we can optionally set up a “default” experiment in the domain, which has one and only one treatment. Like normal experiments, default experiments can have filters. When we need to adopt a feature, we move its parameter-value pairs to the treatment of a “default” experiment. All the parameters in the default experiment(s) collectively create a configuration system for the application, which can provide more flexibility than setting the values in application’s code or properties files, because default experiments can be specific to only a subset of traffic by defining its filters.

Treatment

Goal #10: a behavior of an application can be defined by multiple parameters.

This is often needed because a parameter controls one aspect of the code path, and an application behavior may be a combination of multiple aspects. For example, we can have a parameter for the foreground color and another parameter for the background color. We may want to test only a subset of all the combinations of the colors because we know some combinations won’t work out. We could create another parameter which is the combination of foreground and background colors, but that parameter will have too many combined-colors options, and it’s inflexible if we have to create a new parameter whenever we want to test a combination of a new set of aspects.

For this consideration, we define a treatment as a container of parameter-value pairs. An experiment must have one or more treatments. Once a request is allocated to an experiment (as described in the prior section), we further allocate the request to one of the treatments of the experiment. Then when the application asks for the value of a parameter, we return the value defined in the allocated treatment (or return absence if it’s not defined in that treatment).

Goal #11: the treatment allocation should be deterministic.

Similar to experiment allocation, we assign a unique hashcode to each treatment (note: the same treatment in different experiments have different hashcodes) and use that as the key of a randomization algorithm for allocation.

Goal #12: the randomizations between treatments in experiments in different domains are orthogonal.

Combined with goal #7, this ensures that an experiment in a domain does not skew the result of experiments in other domains.

Goal #13: treatments can be reused and extended into new treatments.

This makes creating new treatments easier if we already have a similar treatment. This is done by treatment inheritance, very similar to inheritance in OOD. We allow multiple inheritance. Child treatment gets all parameter-value pairs from its parent treatment(s). Child treatment can override a parameter’s value from its parent treatment(s) by specifying a new value in its own definition. If the parent treatments provide different values of a parameter, the child treatment must override the value.

In the example above, the new_treatment effectively has the follow parameter values after inheritance:

fontSize: 14

fgColor: black

bgColor: green

Feature verification

Goal 14: we can specify a treatment name when sending a request to an application and observe the application’s behavior changed by using the treatment.

We can set the treatment name (with a domain name) in the “override treatment” parameter of a Darwin API request and the API will skip randomization steps and force the request into the specified treatment. This is very useful for verifying a feature before starting an experiment.

Compilation and serving environment

As you can see in the above examples, Darwin configuration is specified in human-friendly syntax. The process of compilation is to transform such a syntax to an efficient machine-readable format, after validating the syntax and constraints (e.g. same parameter cannot be defined in multiple Domains in a Context, enough segments are available for all experiments in a Domain, etc.). This is implemented as Maven project and runs on Jenkins. The compilation outcome is a JSON file per Context and is uploaded to S3 for applications to download.

Goal #15: traffic allocation should be consistent in different serving environments.

One lesson we learned is that different environments may result in different compilation outcome thus different allocations. For example, a change of Java version may change the order of elements on iterating a hashmap. For this reason, we run compilation once in a central location and use the same compilation outcome in all serving environments. During serving, we only need to compute the hashcode of API requests and allocate requests based on the hashcode. We made sure that the hashcode computation is environment consistent and the allocation algorithm (simply doing mod operation) is also environment consistent.

Comparison to other available options

There are a few open-sourced projects in this domain and some are well-known. Optimizely is a widely used experimentation platform, but it requires sending visitor requests off-premise and that doesn’t work for our high traffic use cases. Some smaller open-sourced projects are either programming-language-specific (e.g. Abba and Alephbet are for Javascript only, and Vanity is for Rails only) or do not provide experiment allocation (e.g. Sixpack). And of course, LinkedIn, Airbnb, Uber, etc. have their in-house experimentation frameworks but those are proprietary and not available to us. In this section, we focus our comparison between Intuit’s Wasabi, Facebook’s PlanOut, Indeed’s Proctor, and Darwin on the following aspects.

Architecture and Tech stack:

Wasabi: centralized REST service, HTTPD and Jetty servers behind a load balancer, bucket assignments stored in Cassandra and events stored in MySQL

PlanOut: experiment configurations (defined in Python code; or in YAML if in PlanOut4J) are compiled into JSON and loaded into applications’ memory by PlanOut interpreter.

Proctor: Java code is generated from experiment configurations (in JSON) and imported into Java applications.

Darwin: experiment configurations are compiled in a central location and loaded into applications’ memory by a loader. Traffic allocation happens in applications’ memory and the result is consistent in all environments because unit ids are allocated during compilation.

Client-side (such as Javascript) experiment support:

Wasabi: Out-of-the-box, because it provides a REST service

PlanOut: Not out-of-the-box

Proctor: proctor-pipet provides a REST service

Darwin: Not out-of-the-box, but we are building a wrapper to provide a REST API

API Response time:

Wasabi: SLA: 30ms server-side, partially due to the complexity of the system tiers

PlanOut: not clear

Proctor: not clear

Darwin: Single-digit milliseconds

API access pattern:

Wasabi: Wasabi API requires an experiment name, in other words, an application has to know the experiment name to call the API. Need to make multiple API calls to participate in multiple experiments.

PlanOut: applications access allocation results by parameter names.

Proctor: applications access allocation results by experiment names.

Darwin: applications access allocation results by parameter names. They do not need to know the assigned experiment name or treatment name, because only parameters affect the behavior of applications.

Experiment eligibility rules:

Wasabi: supported by targeting rules.

PlanOut: not supported out-of-the-box. PlanOut4J embeds eligibility rules in namespace definitions, which makes it inflexible to change the eligibility rules.

Proctor: supported. Rules are defined in an experiment.

Darwin: supported by filters.

Mutually exclusive experiments:

Wasabi: supported but experiment priority with targeting rules and sample rate settings are tricky to make the desired allocations among mutually exclusive experiments.

PlanOut: supported by namespaces. Experiments in the same namespace are mutually exclusive.

Proctor: not supported but one can put all buckets of mutually exclusive experiments into one experiment.

Darwin: supported. Experiments in the same domain with filters mapped to overlapping traffic are mutually exclusive. Compiler validates available segments for mutually exclusive experiments.

Treatment (or bucket) concept:

Wasabi: parameters and their values can be specified as one string in “payload” of a bucket. No support for bucket inheritance to extend existing buckets into new buckets.

PlanOut: parameter centric. An experiment returns only one parameter and its value.

Proctor: same as Wasabi.

Darwin: A treatment can contain multiple parameters and their values. Natively supports the data structure of a list of parameter name-value pairs. Supports treatment inheritance and parameter value override.

Making experiment configuration changes:

Wasabi: through the provided Admin UI.

PlanOut: by editing Python (or YAML in PlanOut4J) files. Need to “add” and “remove” experiments to maintain the order of experiment starts and stops to maintain consistent traffic allocation after making changes.

Proctor: by editing JSON files. If a bucket is removed, traffic allocation on the remaining buckets might change.

Darwin: by editing simplified JSON-like files. Darwin compiler automatically figures out the changes for traffic allocation by comparing the version in-use and the new version submitted. No need to “add” or “remove” experiments like in PlanOut.

Analytics:

Wasabi: provides real-time analytics and metrics visualization.

PlanOut: basic logging is provided in the Python implementation.

Proctor: no built-in support.

Darwin: no built-in support. Because our metrics are far more complicated and differ case-by-case, we decided to let developers build their tracking and analytics for their own use cases.

Conclusion

Experimentation traffic allocation can be a very complex process when we need to support different kinds of use cases across multiple teams in the organization. We designed and built Darwin to provide our experiment designers with a powerful, flexible, generic, and comprehensive platform to facilitate their day-to-day operations. Since its launch, Darwin has been adopted by many teams in Glassdoor as their powerful tool to make data-driven decisions.