Experimentation Platform in a Day

Published in

Deliberate Data Science

6 min readOct 16, 2022

Over the past few years I’ve had the privilege to build out large scale experimentation platforms at both Airbnb and Coinbase. As I’ve talked with Data teams at other companies, one consistent theme I’m hearing is that spinning up a new experimentation platform feels like a big investment. It surely can be, but I want to dispel the myth that you need a ton to get started. Instead, most of the ingredients you already have and in most cases there’s no need to get fancy.

So let’s get to it and build an experimentation platform in a day.

Configuration

First we’ll need a way to configure our experiments e.g. their variants and their sizes. As experiments control the behavior of your product, they can be thought of as just another piece of (dynamic) configuration. Likely you already have something for this purpose! But let’s say you don’t and given we only have a day, let’s just hardcode it.

Delivery

Experiment delivery is the process of getting the right variant to our experiment subjects (e.g. users).

Randomization

The simplest design we’ll support on day one is a Randomized Control Trial (RCT). Avoiding the complexity, performance and infrastructure costs of maintaining state, we’ll implement this via a stateless hash function which gives us both the randomness and consistency we need for a RCT.

Exposure Logging

As hinted at above with _log_exposure, we’ll need to keep track of “who saw what and when” in our experiment ie. an exposure log. These logs will need to be durable and stored for analytic usage so that we can measure whatever is relevant for a given experiment. I’m going to assume that you have some way of transporting such logs into a Data Warehouse (e.g. BigQuery|Snowflake|Redshift via Kafka|Kinesis|PubSub etc.) If you don’t you may need to go down a level in the Data hierarchy of needs.

But at the end of the day, this log would land as a table — let’s call it xp.exposures — that looks like

Implementation

Now we have all the pieces to actually modify production code with our experiment and a small switch

Analysis

At this point we’ve implemented a RCT, and we know “who saw what and when”. Good news, our production engineering pieces are done! Next we conduct the simplest possible analysis: t-tests from aggregated data (ie. sufficient statistics). We’ll implement in SQL for clarity and scalability as this means the entire computational pipeline never needs to leave our Data Warehouse.

Measurement

In addition to exposures, we’ll assume that other core business events have been landed as fact tables. We can use this data to form the measurements and outcomes of our experiment e.g. fact.purchases that might look like

Assignment

Since our randomization is via stateless hashing, we may have multiple exposures per subject. So first we need to deduplicate our exposure log so that we have one row per subject, experiment. Additionally we’ll keep track of the first exposure time stamp for attribution (next step)

Attribution

Here we attribute outcome events to subjects so long as they occur after the subject was first exposed to an experiment. Note that this is a “fanout join” as a single purchase may be attributed to multiple experiments.

Subject Level Fact Aggregation

Next we aggregate to the subject-level, once again with a single row per subject, experiment.

Summary Statistics

From our per-subject,experiment outcomes we can now compute summary statistics.

Hypothesis Testing

Finally we can perform our t-test by comparing our treatment stats and control stats

Ideally, your Data Warehouse will enable you to create UDFs that implement TTEST_IND_FROM_STATS, but if necessary this can be done after the data has been extracted in its extremely compressed with only one row per experiment,variant. The scipy function of the same name can be used as reference.

Presentation

Again we’ve only got a day, so the presentation is going to suffer. The above queries can be run manually and the results presented from a notebook or slide deck.

Where do we go from here

Phwew! That was a lot for a single day. But all the pieces are there and we CAN run experiments. But obviously there’s more to do… Here are some of the most obvious limitations that we have after day one.

Configuration

Hardcoded experiments are difficult to change so we’ll want to integrate with whatever drives dynamic configuration in your organization broadly so that anyone can add and update experiments as they need to without code changes.

Delivery

Additional targeting mechanisms will be necessary to help narrow the eligible set of users for a given experiment. As of day one, the eligible set is simply “whoever causes this line of code to run”. By pairing our experiments with a targeting system, we can run experiments on users in a specific country or on a specific platform.

Probably the easiest example of additional targeting would enable a slow rollout. By layering on another hash function with a “rollout percentage” we can more carefully deliver experiments to a smaller population to monitor performance before more of our user base begins seeing the experiment.

Analysis

The queries above are agnostic to “purchases” and as such are imminently templatize-able for other outcomes. By building a semantic layer we can drive the rendering and execution of these queries for any outcome we care about in a consistent way and also automate runs via a scheduler e.g. Airflow.

We are also doing the baseline statistical analysis, and there are many updates to statistical methodology that could be implemented (variance reduction, sequential testing, etc.)

Presentation

Experimentation is a technical subject but its customers should not be expected to have deep knowledge of statistics. Therefore we need to present these results in an accessible way for all decision makers. Some companies/tools (e.g. hex.tech or streamlit) are trying to make notebooks into useable apps which would be a good intermediate step, but in the long term a fully developed UI will really help with broader adoption of experimentation practices.

In Closing

I hope that this description shows that we need not wait weeks and months to get started with experimentation. Instead, most of the pieces are likely already present. By starting simple and delivering fast we can get started quickly and prove the value of additional investment in experimentation platforms.

On day two and beyond, happy experimenting!