Building a System for Online Experiments From Scratch

How to keep sanity when setting up your experimentation methodology at a startup.

Anna Pershukova
Geek Culture
7 min readMay 13, 2021

--

Photo by John Barkiple on Unsplash

A/B testing and data-driven product development is all over the tech industry and became a standard decision-making tool in many companies. It’s a methodology that can provide you with evidence to support your ideas or guard your systems if something goes wrong with innovation.

While the general concept of controlled experiments is quite simple with a lot of materials available, in practice there are many definitions and decisions to be made to ensure experiments work effectively. One of the challenges I had when started working on online experiments methodology in a startup — was defining the minimal set of functionalities needed to operate and scale experiments, while justifying their priority against the engineering team.

In this post, I won’t go deep into nuances of experiment methodology, but rather share 4 basic challenges that need to be solved when designing, running, and analyzing experiments, and suggest possible solutions.

Challenge 1: defining experimentation units

Before launching an experiment you’ll need to define units of experimentation and make sure you can identify them over a reasonable time.

Why it’s important?

Accurate identification will affect your ability to serve the same variant to the same person throughout the experiment, connect experiment results to the experience served, and calculate longer-term outcomes correctly.

What’s challenging about it?

In many cases, your ideal experimentation unit is a person who’s using your technology. However, it’s not always something you can measure. In practice, you can identify unique app downloads or devices, write and read cookies. It’s important to be aware that in the real world there can be different people using the same device, the same people downloading-uninstalling an app several times, or deleting your cookies.

What can you do?

There might be different combinations of product, engineering, and algorithmic solutions to improve unit quality in every case. Solutions are usually specific to your domain. But here are some examples of different approaches to this problem:

  • Product: encourage users to create and log in to individual profiles, even if they connect from different devices,
  • Engineering: identify installations from the same device by saving device uuid,
  • Algorithm: calculate the probability of 2 visits belonging to the same person based on a combination of different parameters, ex. mouse movements, frequent pages visited, IP address, etc.

Challenge 2: randomly allocating to experiment variants

Once you’re aligned on units, you can start allocating users to the experiment variants.

Why it’s important?

You’ll need a solution to ensure that allocation is random and supports the testing methodology. You’ll also want to save allocated variants for each uuid, to analyze results, and serve the same experience if the user comes back to the system again.

What’s challenging about it?

Based on the read/write vs compute cost in your system you might prefer accessing the database in real-time to get the variant or calculating it on the fly in a pseudo-random manner.

What can you do?

Solution 1. Let’s say we have uuid that we want to allocate to one of 2 possible experiment variants. The simple solution would be to generate a random number, 0 or 1, and to serve a control variant to those who got 0, and a treatment variant to those who got 1.

Solution 2. Another approach is to generate a variant “on the fly” from the unique user parameters without accessing previously assigned values. For example, we can hash uuid and allocate those users with the last even digit to control and the last odd digit — to treatment.

Important to keep in mind! Using the last digit of uuid without hashing for allocation is not a best practice. You never really know what kind of pattern can be found in your data. It can potentially introduce bias, and random assignment is a prerequisite of a reliable controlled experiment. So it’s always better to be on the safe side and randomize.

Challenge 3: running more experiments faster

Once you’ve got up and running your first experiment — it’s time to think about scale.

Why it’s important?

If you’re investing all this effort to build foundations for data-driven decision-making — you’ll need to build for scale. You’re not going through all this to run a single experiment, right?

What’s challenging about it?

Running one experiment at a time would mean that there’ll be a limited number of ideas that you can test per time. In practice, one experiment can take somewhere from several weeks to several months which sets your maximum number of experiment-driven decisions at 6 to 24 per year. This is not enough if you want to move fast and evaluate all your ideas with data.

Waiting for the results of one experiment to start running a new one won’t allow you to build a scalable data-driven decision-making methodology. Photo by Matt Lee on Unsplash

Also, it’s important to remember that you’ll need a minimum number of users in each experiment to draw relevant conclusions.

What can you do?

Solution 1. You can run different experiments on different groups of users. It will help if the ideas you want to evaluate are aimed to impact different segments of users. It may not sound intuitive, but if you want to test several ideas on the same segment of users this approach can slow you down even more.

To illustrate this let's say we want to run 2 experiments simultaneously and we have 100 relevant users per week. We can allocate 50 users to the first experiment and another 50 — to the second experiment. For example, you need a minimum of 200 users in each experiment. Eventually, you won’t be able to see results for both experiments earlier than in 4 weeks. If you run one experiment after another you will get the desired number of users and make at least one decision after 2 weeks.

Solution 2. Run independent experiments on the same segment where users can participate in several experiments at a time. To make this work you’ll need to randomize independently for each experiment.

If we take randomization options from challenge 2 — in the first case, you’ll need to generate a random number and store user variant for each experiment you’re running.

As a result, you’ll get a record of allocations you’ll need to access whenever you are serving this experience.

In the second case, you can add the experiment identifier to the user identifier and hash them together to compute consistent variants every time you serve the experience.

Example of user experiment hashing

Challenge 4: identifying the segments that have the most chances to succeed

Why it’s important?

Although it’s not uncommon to have no or very small experiment effect, calculating outcomes for users who had limited access to the treatment experience or didn’t have it at all can lead you to falsely claim experiment failure.

What’s challenging about it?

Sometimes the idea you want to test impacts only users who reached some experience milestone down the funnel — didn’t churn within a week or arrived at the checkout page. This can be a problem if you allocate users to variants immediately after joining. So you’ll need a way to identify affected or potentially affected (in the control group) users.

For example, you want to experiment with a promotion sent a week after a user first interacted with your system. If your first-week churn is 50% — only half of the users will have the potential to see your message.

What can you do?

Try narrowing down the allocation to the most granular characteristics you can identify in both treatment and control groups.

Let’s say we have 100 users joined per week. We’ve randomly assigned 50 users to treatment and 50 to control. So if 5 users completed the order in the treatment group, and 2 in the control — the initial effect of our promotion among all joined users would be:

Comparing experiment results only for users who are 1 week retained will help you understand more accurately the effect you’ll get. If we’ll account for users churned and didn’t have the potential to receive our promotion the effect will be:

Important to keep in mind! This logic can be tricky sometimes. For example, you can say that only people who opened the message are most affected. But in the control variant, you won’t be able to identify a similar group of users because you never sent them the promotion.

Another problem would occur if you’re trying to isolate a segment of users most exposed to the experiment experience. You can think that comparing “heavy” users would give you a better insight into the effect you generate. But if one of your experiences converts users into “heavy” more effectively — such analysis will be biased. If you want to use this strategy — you’ll need a way to define “heavy” users before they enter the experiment.

So make sure you select the criterion that you can identify in all groups and it doesn’t create bias between treatment and control.

Conclusion

Designing, running, and analyzing online experiments can be nuanced. I believe this post will help you to facilitate a discussion with your engineering team, plan experiment infrastructure, and to get valuable results.

What other challenges did you face when setting up your online experimentation system and methodology? Let me know in the comments!

And if you want to learn more about online controlled experiments at scale — check Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. In Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing

--

--

Anna Pershukova
Geek Culture

Data professional building intelligent data products. Data scientist @Medisafe.