A quick business guide to randomized experiments

An intuitive step-by-step discussion on experimental design for business professionals

7 min readNov 6, 2023

🔍 Experiments aren’t just for data scientists; as a business professional, understanding experimental design is crucial to unlocking data-driven insights. Here’s a post on designing randomized experiments in a business setting, tailored for a non-technical audience.

Formulate your hypothesis as a sentence

Experiments are the backbone of data-driven decision-making. They serve to answer critical business questions, typically framed as hypotheses. For instance, consider a delivery platform (like Rappi, Uber Eats, Ifood) evaluating user behavior:

Hypothesis: Does making the client start to do grocery via the app increase the overall value and frequency of her/his subsequent purchases?

This question — which we could also write as a statement — stems from the notion that grocery shopping can create a routine, potentially spilling over into other categories. For instance, after doing groceries through the app I may find it convenient to also start ordering more food delivery and drugstore items.

Yet, it’s crucial to differentiate a hypothesis from mere intuition. So support your hypothesis with data — be it internal analytics or market surveys — to avoid falling into the HiPPO trap (Highest Paid Person’s Opinion).

Moreover, it’s vital to ensure that your hypothesis is actionable and relevant to the business strategy. 💡 It may not make sense to test the effectiveness of “levers” if we are not able to push them in real conditions.

Identify the metrics you are interested in and design the intervention that may alter them

Your hypothesis should dictate the metrics you track. If you’re testing whether grocery shopping influences other purchasing habits, your intervention might involve targeted discounts via push notifications. Remember to select metrics that are both measurable and directly linked to your hypothesis, such as the frequency of purchases or average transaction value excluding groceries.

The method used to encourage “doing groceries through the app” should be designed in a way that only affects the variable of interest to maintain a clear causal relationship — we say it is designed to exogenously alter the variable of interest. One common practice used by digital companies is to send push notifications with offers or discounts 💬💸.

In this example, we want to induce “doing groceries” through the app in the first stage, hoping to observe that this action increases the subsequent frequency and value of purchases in the second stage 🛍️. This is the path from cause to effect that we are trying to validate with an experiment.

Ensure these metrics are accurately measurable and directly related to the hypothesis. Keep in mind that for statistical computations the variance of these metrics and their potential of responding to the treatment will affect the minimum sample size needed for the experiment, but this topic is worth its own blog post.

Establish Priors and Determine the Minimum Detectable Effect (MDE)

Here we are discussing a two-step interaction: First, clients are prompted via a push notification and they choose whether to comply or not — i.e., to do groceries through the app or not. This type of opt-in experiment usually involves a lot of friction, and that’s why I am discussing it here.

Given this opt-in approach, we must estimate the expected share of clients who will respond to the push notification and make a purchase. We call it conversion rate, and let’s say we expect 5% of the clients to answer the push and start doing groceries online.

We must also estimate the magnitude of the effect on purchases for those clients. For instance, we may anticipate a 5% increase in the total monthly value of purchases for the clients who started doing groceries because of our push campaign.

Based on these parameters (conversion rate of 5% and effect on the compliers of 5%), we can establish the smallest effect size that the experiment is designed to detect in a statistically significant manner.

Be realistic and modest about these effects (see this post)!

With these elements in hand, now comes the part of setting the significance level and the power of the experiment:

Choose a significance level, which indicates the probability of concluding that there is an effect when there is none (type I error). We usually set it to 5% (α = 0.05).
Choose the statistical power, which indicates the probability of correctly rejecting that there is no effect (or, in less “statistically precise” words, detecting an effect if there is one). We usually set it to 80% (1 — β = 0.8).

The choice of these values should be informed by the context of the experiment and the potential cost of errors. In some cases, it might be adequate to set the statistical power to 90%.

Perform Sample Size Calculations, select the Sample, and randomize the Treatment Assignment

Before you can execute the experiment, you’ll need to calculate the minimum number of clients (sample size) needed to detect your anticipated effect. Tools like G*Power can aid this process, while some people (me included) use their own scripts to perform simulations.

Now that you know how many clients you would need to run the experiment, choose a sample from your active users, possibly targeting specific client segments:

For example, initially, you may choose clients who have searched for groceries in the app within the past 30 days.

Then randomly assign individuals to receive the treatment (i.e., offers and discounts) or be part of the control group (business as usual). This random assignment is crucial to reduce bias and ensure the validity of the experiment. You may want to assign less than half of the sample to the treatment and more than half to the control group (or the other way around)— if so, it might alter the minimum sample size needed, so go back to the sample size calculation.

Some cautions before implementing the experiment

After selecting the sample, remember to collect baseline data to understand the initial conditions of the clients before the intervention:

Are they similar in terms of the number of monthly transactions, average value of transactions, app accesses, etc?

Besides being important for ensuring that both groups were similar ex-ante, controlling for characteristics in the ex-post regression analysis may also increase the precision of the estimated effect.

Moreover, remember to develop a plan to assess the fidelity of the implementation and stop the experiment in case of unexpected results. For instance:

Use software to track whether participants receive notifications and whether they are browsing the intended sections of the app (e.g. https://amplitude.com/).
Ensure that all team members who are interacting with participants (e.g., customer relationship management, customer support, etc) are trained to deliver the program consistently.
Define criteria for interim analysis and potential stopping rules in case of unexpected outcomes or adverse effects (see guardrail metrics).

Implementing the experiment, monitoring, and performing Post-Experiment Evaluation

Now it’s time to integrate the experiment you designed with the operational environment. This stage is done with the team responsible for the segment. In many companies, each segment has its own developers, product managers, and CRM specialists, and they should understand the intuition behind each step of the experiment so it can be implemented as designed.

Once the selected clients have been notified, you should already be able to monitor the experiment. This is usually done with analytical tools like dashboards to track the behavior of individuals throughout the experiment. This step enables us to learn when to strengthen the communication strategy (i.e., pushes, SMS, e-mail) or even stop the experiment in case of adverse effects.

At last, we should carry out the post-experiment evaluation. After the experiment period has ended, we can evaluate its success, identify lessons learned, and document recommendations for future experiments. It’s crucial to not only evaluate the outcomes but also the process of the experiment to inform future research.

The knowledge you gain is invaluable, even if the experiment disproves your hypothesis.

This experiment should be reproducible by other team members (although it will likely render different results) and all the knowledge generated in this round should be available in a common company repository.

By the way, don’t get frustrated if at the end of the experiment you don’t find exciting figures — your case is the rule, rather than the exception, as shown in Ron Kohavi’s post.

I hope that this discussion has enriched your understanding and sparked further curiosity. If there are aspects or nuances that you believe I should have explored, I invite you to reach out. 😊