Creating an A/B testing system

Published in

art/work -behind the scenes at patreon

9 min readMay 30, 2018

This is the third and final post of our A/B testing series. Enjoy!

What we’ve learned so far

In our first A/B testing series post, we defined causal inference and explained why randomized experiments (or A/B tests) allow us to make causal claims. We established a framework that simplifies the design of randomized experiments, which uses three main concepts:

Observation units. The units that experience the treatments. Composed of Identifiability (how you identify the units) and Eligibility (the set of conditions they have to meet to participate in the experiment)
Treatments (aka variants). The experiences that the observation units can potentially have.
Response metric. A measurement you’ll get for each observation unit.

An experiment analysis consists of comparing the response metric among treatment groups to determine if the treatment had an effect on the observation units.

In the second post of the series, we discussed the importance of sizing an experiment, and established a reusable randomization framework for power and sample size calculations.

At this point, you should be able to answer:

Who will be part of the experiment?
What will the treatments consist of?
What metric will we measure?
What type and size of effect we expect our treatment to have on the response metric?
How much data do we need if we assume a given effect size and an analysis method?

This post will help you answer the following question: How do I decide which unit gets which treatment? This question is at the core of the experiment implementation stage. In the causal inference literature, the procedure by which assignment is done is called the assignment mechanism. We will not focus on the theoretical aspects of assignment mechanisms, but on how to implement a basic assignment mechanism that works for online experiments.

Simplest assignment mechanisms

The simplest assignment mechanism is called complete randomization. It consists of distributing a fixed set of observation units among treatments at random. For example, you have a list of 100 user_ids, and you randomly assign 30 to treatment A, 40 to treatment B, and 30 to treatment C.

This randomization is very easy if you have a fixed population to experiment on, and if you can identify all the units at the design stage. In many online experiments, these assumptions do not hold. We usually don’t know who will visit a website in the next few days (there might even be new users that we can’t identify at the design stage).

A way to get around this problem is to assign units “as they arrive.” Every time we see an identifier (e.g., a user_id), we draw a treatment at random. In probabilistic terms, we draw a categorical random vector. The Law of Large Numbers guarantees approximately correct group proportions. This assignment mechanism is called independent Bernoulli trials. Although this algorithm solves the problem of not knowing who and how many units will visit, it has a clear limitation: think about an experiment that consists of showing different versions of a webpage. Under the independent Bernoulli trials mechanism, if the same unit loads the page more than once, they might experience a different treatment every time, which could make the unit useless for the experiment (it would not be clear which treatment we should attribute their reactions to).

Hash functions to the rescue

Hash algorithms come to the rescue. Let’s take MD5 as an example. The MD5 hash function takes a string of almost any size and converts it into a 128-bit hash value.

md5('my_string') = '3d212b21fad7bed63c1fb560c6a5c5d0'

That hash value is just a random set of characters. Think of it as a number between one and one Duodecillion 😮. We don’t need that many numbers, so let’s keep the first 10 bits of the hash value (this is equivalent to approximately any number between one and 1,000).

The truncated MD5 map function guarantees:

Determinism: Always maps an input to the same output (otherwise it wouldn’t be a function).
Uniformity: Given a new input, it’s equally likely to produce any number from one to 1,000.
Anti-continuity: Two similar inputs will most likely lead to non-similar outputs.
Speed: It’s really fast to compute (probably faster than retrieving a value from storage).

Why is the truncated MD5 function useful for us, you might ask? Because we can use it to map any identifier (old or new, known or unknown) to a random uniform number between one and 1,000. Further, if we assign treatments to the 1,000 buckets according to some desired proportions, by transitivity, we’ve built a method that assigns treatments to identifiers in a consistent way, and with the desired probabilities.

More about buckets

Thanks to the bucket layer that separates the universe of observation unit identifiers and the treatment space (the set of possible treatments), any experiment can be treated as a completely randomized experiment (we can randomize the 1,000 buckets as a fixed and known population).

This bucketing framework is quite powerful because it can be used with any type of identifier. Common identifiers are: cookie-based device_ids, and user_ids. However, we could also extend it to request_ids, query_ids, emails, or domain specific identifiers (like campaign_ids at Patreon).

At Patreon, Jonas and Cathy (two of our awesome growth team engineers) created an experiment dashboard where we explore and manage all of our experiments. One of the key things we show for each experiment is the type of identifier(top right of each experiment card).

Other critical features of the system: versions, overrides, and more

Versioning

In online experiments, it is common to roll up and down treatment proportions. For example, you might want to cautiously roll out to a small proportion of identifiers in order to get a first glance of their reaction. You might also have to roll back the experiment because of a bug, or you might want to change the treatment proportions to get more data for one treatment after you’ve discarded another one. This experiment framework allows for treatment proportion changes through a versioning feature.

A version is a mapping from buckets to treatments. Within a version’s life, each bucket will keep a single treatment. Across versions, we impose one fundamental restriction: a bucket can be assigned to at most one active treatment (an active treatment is any treatment that is not the control). This rule allows us to always fall back to control, and ensures some level of stability in the experiences that users can have.

Stability primarily matters because it leads to a better user experience. Users expect websites to be generally consistent. If they’ve seen a given version of a page or feature in the past, they expect it to behave the same the next time they visit. If their experience changes because they’re in an active treatment of an experiment, they have to make an effort to learn the new version. The least we can do is guarantee that the users will have to learn at most one new experience during the test. Second, you could (to some extent) argue that every user has potentially seen the control version in the past. Therefore the group contamination that control-active switches imply for analytic purposes is slightly less bad than switching between active treatments.

The next table shows how our implementation would handle a sequence of versions. The trick is that we first create a static random bucket order for each experiment. Then we fill the reorganized buckets in order according to the proportions that we want.

At the analysis stage, to have a valid analysis, you should only analyze units that experienced the experiment within a given version. This way you can claim that during that time window they only saw one version, and during their whole life, they’ve seen at most one active version and the control experience.

Here’s a quick look at what rolling out a new version looks like in Patreon’s experiment dashboard:

Overriding identifiers

As the name suggest, the override feature allows admin users to override a specific identifier to always receive a specific treatment, completely ignoring the randomization process. The ability to override was a highly-requested feature by developers and product managers. It facilitates development, showcasing the feature, and user research, because admins can override their own identifiers depending on what they want to QA or show. We implemented this by bypassing the bucket layer and directly assigning a treatment to the unit. This feature is also managed through Patreon’s experiment dashboard, as shown below.

Restricted bucket space

An additional advantage of the bucket layer is that we can easily block certain parts of the identifier space across experiments. By defining a bucket as blocked for a specific experiment, you guarantee that any identifier that would be mapped to that bucket would get the control version of the experiment.

This makes it easier to have long-term holdout groups: you can just block certain buckets for multiple experiments and guarantee that they’ll see only control experiences. It can also be useful to deal with multiple conflicting experiments within a specific identifier space. Team A could be assigned to buckets 1 to 500 to experiment on their feature, while Team B could be assigned buckets 501 to 1,000.

How does the backend work?

The implementation of this experimentation framework consists of three elements:

A MySQL-backed data model:

Some of the less obvious attributes from above are:

Registered_at, activated_at, and archived_at. Registration corresponds to the moment the experiment is defined, activation to the moment the first version is rolled out, and archiving to the moment the experiment will no longer be used.
Randomization_type defines the type of identifier that will be used for randomization during the experiment.
Available_bucket_space indicates buckets that are free to receive an active treatment; any bucket not considered in the available_bucket_space will be assigned the control variant.

2. A controller, which lets us:

Register the attributes of an experiment (and its variants)
Update the experiment attributes
Archive the experiment
Rollout a new version (versions can’t be edited or deleted)
Create, modify, or delete an override
Retrieve a variant assignment for an identifier by: i) verifying experiment status, ii) checking the experiment-identifier pair for overrides, iii) computing a bucket for the identifier using a hash function, iv) retrieving the latest bucket-variant mapping from the last experiment version, and v) returning the corresponding assignment.

3. An API that allows on-demand retrieval of experiment information.

We expose the experiment and variant objects through a single experiment resource. We also expose the experiment version as its own experiment version resource. However, we don’t expose a resource for the overrides object. Instead we created an ephemeral resource called ExperimentAssignment, which acts as an interface to retrieve or modify assignments (whether overridden or natural).

Wrapping up

After reading this blog post, you should understand the core concepts required to implement an experimentation system, including what an assignment mechanism is, why hashing functions are so powerful, why it’s useful to separate the identifier space from the treatment space, why versioning and treatment overriding are required features, and how conflicting experiments and holdout groups can be handled. I also provided a sneak peek into Patreon’s experiment backend implementation and its experiment management dashboard. I hope this is a good starting point for anyone who is planning to build randomized experimentation infrastructure for their website!

Want to help us solve problems like this every day? Join us!