A Framework to Design Randomized Experiments (The Right Way)

Part 1 of our A/B Testing Series

Published in

art/work -behind the scenes at patreon

10 min readMar 8, 2017

Patreon is a small company with around eighty employees, which means that our five-person data science team does a broad range of activities: business analysis (e.g. defining and tracking company KPIs), data requests (e.g. building dashboards for teams), data modeling (e.g. recommendation systems and predictive models), and, product evaluation (e.g. designing and analyzing randomized experiments).

Out of these four, we probably spend the most time doing product analysis, so standardizing the way we think about experiments was a must in order to ensure Patreon could make better product decisions faster.

This post will give an overview of the main concepts that compose our framework (mainly an adaptation of the Potential Outcomes framework that I learned during my graduate program in Statistics), and will be most useful for data analysts or data scientists who want a clearer picture of what randomized experiments (aka a/b tests) are, and who would appreciate a reusable theoretical framework to design, implement, and analyze experiments in a software/website context.

1. What is causal inference and experiment design?

1.1 What is causal inference?

Imagine you wake up with a headache today, you take some medicine, and within one hour your headache goes away. Did the medicine cause you to feel better? The answer is not obvious, because you don’t know what would have happened if you had not taken the medicine.

That is what causal inference — and ultimately, experiment design — is all about. Making statements about events that might cause other events, and estimating the uncertainty associated with those claims.

1.2 Why do randomized experiments help identify causality?

Continuing with the example from before, imagine that, every single time you got a headache in the past 5 years, before taking the medicine, you also slept 15 minutes longer than usual, you drank some extra coffee, and you didn’t receive a call from Betsy: that annoying friend who calls you every day and just talks and talks. How can you tell if it was the medicine, the sleep, the coffee, or your lack of Betsy that relieved your headache? The answer is: you can’t.

How about if you do the following: every time you get a headache, throw a coin to decide whether or not you’ll take the medicine. Note that you shouldn’t make any other decision based on the result of the coin flip. This is where the magic of randomness comes in, if you always throw a coin before deciding to take the medicine, and you make no other decision based on the result, then, in the long run, the group of cases where you took the medicine and the group of cases where you didn’t will be similar in every aspect, except for presence or lack of medicine! Therefore, it’s “reasonable” to assume that any difference between the groups is an effect of the medicine.

2. The framework: concepts and definitions

The framework that we use at Patreon to think about randomized experiments is surprisingly simple.

You have a target observation unit (in the example above, it’s you), and you want to know the effect that a certain set of treatments (e.g., taking the medicine or not) would have on you. Then you assume that you (the observation unit), after taking the medicine or not (each of the treatments), could end up in a set of potential states, which we’ll assume can be well-represented by a response metric.

Ideally, you’d like to observe the response metric for your target observation unit after applying all treatments, so that you can claim that any differences were caused by the treatments, but unfortunately, that is impossible, because as Heraclitus said: “No man ever steps in the same river twice, for it’s not the same river and he’s not the same man.” So, what’s the next best thing?

Get a group of observation units that are similar to (or to sound fancier, representative of) your target observation unit, assign them the treatments at random, measure the response metric for each observation unit, compute some summary (usually the average) of the response metric for each treatment group, and compare the summaries, assuming that this comparison represents well the original difference that you wanted to understand for the target observation unit.

Summarizing, there are three main concepts that we talked about:

observation units: the thing that you assign randomly to treatments
treatments: all the different experiences that the observation units can have
response metric: the measurement that you use to represent the state of each observation unit after they experience a treatment

Here’s a diagram that might help visualize the concepts.

Observation units, treatments and response metric diagram.

The key thing to remember is that for each observation unit, there are different potential states that they could end up in after taking each treatment, but we will just be able to ever see one of the potential states represented by a single measurement of the response metric.

3. Randomized experiments in websites

3.1 Why do software (and website) companies use experimentation?

Randomized experiments are not so new, and several industries have used them for a long time. For example, the Pharmaceutical industry uses experiments to prove that their medicines work when they present them to the Food and Drug Administration.

So, why is A/B testing such a hot topic in silicon valley if several industries have done it for many years? Well, it could be because silicon valley folks (like me, unfortunately) tend to claim that everything they do is new and interesting, but more likely, because designing experiments and capturing data in software companies and particularly, websites, is an easy and fast way to learn about the way users are engaging with the product and iterate through features towards a better product.

Next, I’ll continue to describe how to map each of the concepts in section 2 to a website experiment design, where the target is, as we mentioned earlier, to understand user reactions to product changes. To make it clearer, we’ll describe each step with a (semi) real Patreon example where we wanted to understand if a new creator page layout (on the right) was better at getting users to become Patreon members.

Old (left) and new (right) creator page versions.

3.2 Identifying observation units

We use two concepts to define an observation unit: identifiability and eligibility.

3.2.1 Identifiability.

You should start with finding a way to identify your users. If the experiment is visible to logged in users only, you can probably identify your users with a user_id. However, if the experiment is also visible to logged out users, you probably want to do it via setting a “permanent” unique user identifier (uuid) in the visitors’ cookies. This will allow you to approximately track the behavior of a specific person as they browse through the site.

In our experiment, logged out users can see any creator page, so we used a uuid that we put on the browser’s cookies whenever someone interacts with any part of our site. These uuid did not have a 1 to 1 relationship with users because cookies change by browser/device, but we made the judgement call that this identifier was good enough.

3.2.2 Eligibility.

The key here is to clearly identify the last thing that all of the users in all of the treatment groups will have in common immediately before they get forked into different experiences (treatments). At Patreon, we’ve found that logging an event for this “last experience” is a good practice, because it allows you to very explicitly say who is participating in the experiment, and therefore who you can extrapolate your results to. We call this the eligibility event.

In our experiment, the eligibility event was receiving a request to load a creator page. So any uuid that requested a creator page during a given week was susceptible of being an observation unit.

Note that the eligibility event could be something more complicated, like spending 10 seconds within a given page, receiving a specific email, running a specific query, etc.

Also, we could have used something super general for our eligibility event, like being active anywhere on the site during that week, however, the ideal is to have the eligibility event as close to the real forking point as possible to avoid diluting our observation units. The further the eligibility event is from the real forking point, the least sure you will be that observation units in distinct groups actually had different experiences.

A lot of times there’s additional characteristics that you want to consider to make a user eligible to participate in the experiment. At Patreon, we call these additional criteria eligibility criteria, and we usually log them as a properties of the eligibility event.

In our experiment, for example, we wanted to understand the behavior of people in the US only, so, at the moment when the creator page request was made we identified where the user was located and logged a location property in the eligibility event.

Summarizing, in every experiment, you should have clarity on the following:

Identifiability
Eligibility (eligibility event within a certain time window and eligibility criteria)

3.3 Applying the treatments

The way we assign treatments to observation units at Patreon is by mapping the identifier to a random number between 0 and 1 (you can use a hash function for this to ensure that the number is random but always the same for a given identifier). Then, you can just re-direct the user to a different experience based on the random number they get. The key thing to notice here is that you should always randomly assign the treatment on an observation unit level through the identifier that you picked.

To make analysis easy, we also log the treatment that the user gets as a property of the eligibility event.

In our example, the moment we got a request to load a creator page that was done within the us, we hashed the uuid to map it to a number between 0 and 1, and when the result was above 0.5 we showed the new layout of the creator page, otherwise we showed the old version.

To be very specific, the eligibility event you should be logging should look something like:

{ 'event': 'request_creator_page',  'uuid': '0f3e39a8-ff84-11e6-bc64-92361f002671',  'event_time': '2016-11-07 10:04:03',  'event_properties': {      'location': 'US',      'treatment': 'new_creator_page'  }}

3.4 Measuring a response metric

If you recall, the response metric is responsible for representing the state of the observation unit after the treatment has been applied. Therefore, you should always be able to measure the response metric for each observation unit. If you can’t measure it for all observation units, then something is not defined correctly.

In our example, we checked whether we had a “Sign-up” event associated to the uuid within 1 day after we saw their first creator page load event. If so, we assigned a one to that observation unit, otherwise we assigned a zero.

3.5 Analyzing the results

The final data set for analysis should have a single observation unit for each line (represented by the identifier), the treatment that the observation unit got, and the response metric that you measured for that observation unit.

Analysis data frame for ab test with uuid as identifier of observation unit, two treatments and binary response metric.

To make the comparison across groups, you will probably want to summarize the responses by some aggregation, like the average, median or max. In theory, you can go crazy with this, but the crazier you go, the harder it will be to analyze correctly. The most typical analysis to do here is to use your favorite library, (e.g. pandas or plyr), to compute the average and variance of the response metric for every treatment group, construct confidence intervals for each mean, and see if the confidence intervals overlap. Or even better, you can use a Welch t-test to assess the difference in means between the two groups.

3.6 Some important topics we didn’t talk about

We did not talk about analysis in a very formal way, we did not talk about the analysis tools that we use, we did not discuss the technical aspects of how we randomize our users or how we log the events, we did not mention anything about making sure that the implementation of the experiment is correct (Quality Assurance), we did not talk about how to size your experiment (or determining run time). We expect to cover some of these topics in other blog posts of our A/B testing series, so, keep an eye out for them!

4. Wrap up

Randomized experiments (aka A/B testing) are a powerful tool that allow us to make statements about causality. Tech companies use them a lot because gathering data is relatively easy and it allows them to objectively learn how users use their products fast and iterate based on what they observe. The framework that we use at Patreon is composed of mainly three concepts: observation units, treatments and response metrics. We briefly discussed what each one is, and how to identify them and use them in a website experiment context. We hope you enjoyed reading this post!

Reference: As mentioned before, this framework is based the Potential Outcomes Framework, also known as Rubin’s Causal Model. You can find a more formal discussion of it in the book Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction (by Guido Imbens and Don Rubin).