Analyzing Randomized Block Design and Uplift Campaigns with Python

Published in

CVS Health Tech Blog

15 min readAug 30, 2023

By: Audrey Lee & Mathieu d’Acremont (from the Enterprise Digital Analytics department at CVS Health®)

Introduction

A randomized block design is an experimental design in which subjects sharing similar characteristics are grouped together (known as ‘blocks’) and treatment assignment is determined by randomization within each block. This design pattern is widely adopted by researchers in various industries. For example, in a clinical trial on the effect of a drug to prevent a certain disease, the existence of some conditions can be a strong predictor of outcome. Grouping participants sharing similar health conditions and risks will ensure treatment and control groups are comparable. When it comes to measuring the effectiveness of different teaching methods, such as lecture-based or project-based learning, a randomized block design can be used to help identify which teaching methods are most effective for distinct types of students by comparing the treatment effect between blocks. Another example is A/B testing in marketing when different personas are created based on an uplift score. Randomizing customers between the A and B groups within each persona (block) will improve the design.

While randomized block design is useful to reduce variability and to increase statistical power through grouping (Reichardt, 2019), it does bring some difficulties in measurement. Especially when test and control split ratios differ in each block, it is difficult to determine the true impact of the intervention. This is because the blocking factor becomes a confounder, and observed differences in the outcome cannot be attributed directly to the treatment.

This blog post compares different methodologies to address the challenges in measuring an experiment with blocked randomization. It can be applied to test the efficiency of a marketing campaign, a medical intervention, or the behavior change initiatives that improve health outcomes at CVS Health. We compare the Average Treatment Effect (ATE) derived from different evaluation methodologies including a simple mean comparison, regression, and the Weighted Average Treatment Effect (Weighted ATE). Using simulated data from an uplift and a marketing campaign, we show the advantages of using the Weighted ATE over other methods (see Figure 1) and how it can resolve Simpson’s paradox (Ameringer, Serlin, & Ward, 2009):

**Figure 1**: Estimates of the treatment effect in a simulated uplift campaign with varying Test to Control sample size ratio. Only the Confidence Interval of the Weighted Average Treatment Effect (Weighted ATE) includes the true effect (vertical dashed line).

Uplift Campaign Analysis

Data Simulation

The different methods to analyze a randomized block design will be applied first to a simulated uplift campaign. In this simulation, we have two business segments, Public and Private, and two personas, Refractory and Persuadable, thus making up four blocks. The personas come from a hypothetical uplift model. The two Persuadable blocks have an ATE of 4 percentage points (pp) and the Refractory ones have no treatment effect with an ATE of 0pp. Within each block, customers are randomized into a Treated and a Control group (Figure 2).

**Figure 2**: Simulated campaign design with 80/20 Treated/Control in the Persuadable persona and 50/50 assignment in the Refractory persona.

To optimize Return-On-Investment (ROI), the proportion of customers assigned to the treatment is higher in the Persuadable blocks (80%) compared to the Refractory ones (20%). Indeed, if there is an outreach cost, the ROI will be higher if more customers are assigned to the Treated groups in the blocks that are expected to have a higher ATE. In this simulated campaign, the outcome is binary, 1 for the customers who subscribed to a service, 0 otherwise.

The Python module ‘blockeval’ has been written to support the analysis of randomized block design and is available on the CVS GitHub repository. The module is installed with:

pip install blockeval

The code in this blog post can be found in the example notebook. First modules are imported:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from blockeval.analysis import *
from blockeval.utils import campaign_simulation

Then the outcomes of 100,000 customers are simulated for the four blocks presented in Figure 2. The blocks have the same sample size, but the treatment probabilities and the treatment effects are higher in the ‘public_persuadable’ and ‘private_persuadable’ blocks:

uplift_data = simulate_campaign_outcome(
    blocks = ['public_refractory', 'public_persuadable', 'private_refractory', 'private_persuadable'],
    block_sizes = [100000,  100000, 100000, 100000],
    treatment_probas = [0.5,  0.8, 0.5, 0.8],
    control_means = [0.15, 0.15, 0.15, 0.15],
    treatment_effects = [0, 0.04, 0, 0.04]
)

The ‘segment’ and ‘persona’ columns are then derived from the blocks. To control how the results will be displayed, these new columns are converted to ordered categorical variables. The first category of the segment (Public) and persona (Refractory) will be used as the reference level when comparing groups:

# creating categorical columns
uplift_data['segment'] = np.where(uplift_data['block'].str.contains('public'), 'public', 'private')
uplift_data['segment'] = pd.Categorical(uplift_data['segment'], categories=['public', 'private'], ordered=True)
uplift_data['persona'] = np.where(uplift_data['block'].str.contains('refractory'), 'refractory', 'persuadable')
uplift_data['persona'] = pd.Categorical(uplift_data['persona'], categories=['refractory', 'persuadable'], ordered=True)

Below is an extract of the campaign data. The ‘treatment’ indicator is set to 1 for the Treated group and to 0 otherwise:

uplift_data.head()

Simple Mean Comparison

Now that the uplift campaign has been simulated, several methods will be applied to the dataset to calculate the overall and block-specific treatment effects. The first method is a simple mean comparison. When calculating the means, we see that the outcome is higher in the Treated group (17.50%) compared to the Control group (14.91%). The difference gives an overall ATE of 2.58pp (Formula 1):

s = uplift_data.groupby('treatment')['outcome'].mean()
df = pd.DataFrame(s.append(s.diff()[[1]])).reset_index(drop=True).transpose()
df.columns = ['control', 'treated', 'difference']
df

Because the blocks have the same size with 100,000 customers each, we expect the ATE across blocks to be 2pp, the average between 4pp (Persuadable) and 0pp (Refractory). The observed 2.58pp difference is higher than the expected 2pp ATE, but without a confidence interval, it is difficult to judge if the estimation is biased or not (Figure 1, first row).

Regressions Give Conflicting Results

The second approach to estimate the ATE is to fit a regression to the entire campaign and include the treatment indicator as an independent variable. We chose a linear over a logistic regression because it has several advantages when estimating treatment effects (see Gomila, 2021):

regression = smf.ols('outcome ~ treatment', data=uplift_data).fit()
regression.summary().tables[1]

The regression gives an ATE of 2.58pp, the same as the simple mean comparison. The 95% CI of [2.3pp, 2.8pp] does not include the true ATE of 2pp (Figure 1, second row). Thus, the simple mean comparison and the regression overestimate the ATE. It is worth noting that, like regression, a t-test would also give the wrong p-value.

The results are off because the treatment effects and the Treated/Control sample size ratios vary between blocks as is often the case with uplift campaigns. The treatment probability and the outcome average are both higher in the Persuadable compared to the Refractory blocks. In terms of causal inference, we can see that being Persuadable has an effect on both the treatment and the outcome and thus the persona becomes a confounder (Figure 3).

**Figure 3**: The Persuadable persona has an impact on both the treatment and the outcome as shown by the correlations. Therefore, the persona acts as a confounder when measuring the treatment effect.

In an attempt to control for the confounding persona, the ‘block’ is added as a covariate in the regression:

regression_block = smf.ols('outcome ~ treatment + C(block)', data= uplift_data).fit()
regression_block.summary().tables[1]

This time, the regression underestimates the ATE with a value of 1.7pp and a 95% CI of [1.4, 1.9] (see Figure 1, third row).

The reason the regression including the block gives a biased estimate is because the implicit weights used to average the ATE across blocks are not the block sizes, but the block sizes multiplied by the variance of the treatment probability (Facure, 2022). Because the treatment variance is lower in the Persuadable (0.8*0.2=0.16) compared to the Refractory blocks (0.5*0.5=0.25), the latter gets more weights which results in an underestimation of the ATE.

Block Summary

To see if we can recover the true 2pp ATE, we first calculate the ATE conditioned by block (Formula 2). The get_block_summary() function was written to facilitate these calculations. The ‘keep_cols’ argument is used to keep information about the segment and the persona in the summary which will be useful later for group comparisons:

get_block_summary(uplift_data, keep_cols=['segment', 'persona'])

We can see from the ‘eff’ column that the ATEs are close to 4pp in the two Persuadable blocks and close to 0pp in the Refractory ones. It is not exactly 4pp because the outcomes are simulated with a Bernoulli distribution. The four blocks have the same size (100,000) and the treatment probability is higher in the Persuadable blocks (80% vs 50%). Thus, the summary matches the parameters used for the simulation (Figure 2).

Weighted Average Treatment Effect

The overall treatment effect we are looking for is the weighted average between the Persuadable and Refractory ATEs (4pp and 0pp). In our case, it is simply the average (2pp) because the blocks were given equal size to simplify the example. To estimate the Weighted ATE we wrote a function that takes the campaign data as input. The function uses the block sizes as the weights and calculates the Weighted ATE (Formula 3) and its variance (Formula 4). With the variance, it is possible to calculate the p-value and the 95% CI:

weighted_avg_test(uplift_data)

The estimated ATE in the ‘eff’ column is 2.13pp with a CI of [1.88, 2.38]. This time the CI includes the true underlying 2pp ATE (see Figure 1, row 4). This confirms that the result of the weighted average approach is valid. The output also shows 8,515 ‘incremental’ conversions which are obtained by multiplying the ‘eff’ by the ‘group size’.

The group_by argument can be specified to calculate the Weighted ATE for groups of blocks. In this example, blocks are grouped by persona:

weighted_avg_test(uplift_data, group_by=['persona'])

For the Refractory persona, the Weighted ATE is 0.15pp with a 95% CI of [-0.16, 0.47] which includes the true underlying value of 0. For the Persuadable persona, it is 4.10pp with a 95% CI of [3.70, 4.50] which include the true underlying value of 4pp.

As illustrated with these results, mean comparison, t-test, and regression will give conflicting estimates that do not match the true underlying treatment effect. It is thus important to use Weighted ATE instead when a campaign contains groups with a different Treated to Control size ratio. We will show later that the Weighted ATE approach can also resolve Simpson’s paradox.

Comparing Effects Between Groups

The last results showed how to calculate the treatment effect after pooling blocks into groups. We now show how to compare the treatment effect between groups of blocks, for example Persuadable vs Refractory blocks. The function comparison_test() with the argument compare_along=’persona’ calculates the Weighted ATE difference between the Persuadable and Refractory blocks:

comparison_test(uplift_data, compare_along='persona')

The result shows a difference of 3.95pp (‘eff_delta’) with a 95% CI of [3.44pp, 4.45pp]. This CI includes the expected difference of 4pp. Overall using the Weighted ATE allows data scientists to compare the treatment effect between any blocks or any group of blocks.

Bootstrapping and Permutation Tests

The functions provided in this blog post to calculate and compare the treatment effects rely on the assumption that the distribution of the sample means is normally distributed. For campaigns with small sample size (n < 30 for the Treated or Control groups within blocks), skewed outcomes, and/or outliers, we provide functions to bootstrap the Weighted ATE and do a permutation test. This provides confidence intervals and p-values that rely on less restrictive assumptions. In this example, the campaign data is resampled 2,000 times and blocks are grouped by personas to get the Persuadable and Refractory treatment effects:

weighted_avg_bootstrap(uplift_data, group_by=['persona'], n_bootstrap=2000)

The estimates and confidence intervals are very close to the results of the weighted_average_test() function that calculates CI and p-value with mathematical formulas.

Bootstrapping can also be used to compare group of blocks:

comparison_bootstrap(uplift_data, compare_along='persona', n_bootstrap=2000)

Results are very close to the output of comparison_test(). For small sample size and/or skewed outcome distribution the results might differ, and it is recommended to use the bootstrapping functions.

Using Other Weights

Data scientists often have to report the ATE along with the number of incremental conversions driven by the intervention. Multiplying the ATE by the block size will give the number of incremental conversions if both the Treated and Control groups were exposed to the treatment. This is a way to extrapolate the impact of the campaign if scaled up to the entire target population.

Because in reality the control group was not exposed to the treatment, another option is to calculate the conversations for the treatment group only. In this case, the ATE is multiplied by the Treated group size instead of the block size to get the incremental conversions. When calculating the Weighted ATE, the Treated group sizes are used as weights instead of the block sizes (Formula 5 & 6). This can be done by setting use_treated_weights=True in the weighted_average_test() function:

weighted_avg_test(uplift_data, use_treated_weights=True)

The treatment effect is now 2.58pp compared to 2.13pp previously. The incremental conversions changed is now 6,719 compared to 8,515 previously (see section ‘Weighted Average Treatment Effect’). The incremental number is much smaller because the treatment effect is multiplied by the Treated group instead of the block size (Treated + Control). The use_treated_weights option can also be set for the comparison_test() and bootstrap functions.

Addressing Simpson’s Paradox

Simpson’s paradox arises in a study when the result in the overall sample seems to contradict the results found in sub-samples. For instance, the results in the overall sample show a positive treatment effect while the majority or all the blocks show a negative effect. The contradiction comes from the fact that the blocks act as a confounding variable (see Figure 3). It can be resolved by using the Weighted ATE approach which will control for the blocks.

We illustrate the paradox using the example presented by Tom Grigg (Grigg, 2018). In this example, a marketing team runs a survey and asks a random sample of men and women to say if they like a flavored drink or not (binary outcome). In the control group, participants are asked to taste a Passionate Peach drink. In the treatment group, they are asked to taste a Sinful Strawberry drink. The design of the survey is similar to a randomized block design (Figure 4).

**Figure 4**: Design of a fictional marketing survey.

In this design, men have a higher probability to be assigned to the treatment group (56%) compared to women (25%). Thus, the block (gender) will act as a confounder. We first recreate the design in Figure 4 and generate random outcomes:

marketing_data = campaign_simulation(
    blocks = ['men', 'women'], 
    block_sizes = [1600,  400],
    treatment_probas = [900/1600,  100/400],
    control_means = [600/700, 150/300],
    treated_means = [760/900, 40/100]
)

The design is then summarized by block:

block_summary(marketing_data)

We can see that for both men and women the ATEs (‘eff’ column) are negative (-0.25pp and -9.33pp), meaning that participants give more favorable answers to the Peach flavor. Now we calculate the treatment effect with a regression and ignore gender:

marketing_regression = smf.ols('outcome ~ treatment', data=marketing_data).fit()
marketing_regression.summary().tables[1]

The results show an overall positive ATE (+5.60pp) which contradicts the negative effects found within the blocks. The code below calculates the Weighted ATE:

weighted_avg_test(marketing_data)

The Weighted ATE is -2.07pp, a result in line with the negative effects found separately in men and women. This example shows how the Weighted ATE can resolve Simpson’s paradox.

Other Evaluation Strategies

Regression With Interaction Effect

Another approach to compare the treatment effect between blocks is to fit a regression with interactions between the treatment indicator and the blocks:

regression_interact = smf.ols('outcome ~ treatment * persona', data= uplift_data).fit()
regression_interact.summary().tables[1]

Refractory was defined as the first category of the ordered ‘persona’ column (see section, ‘Data Simulation’) and is used as the reference level for the comparison by the smf.ols() function. The interaction thus shows the Persuadable minus the Refractory ATE.

The ATE delta of 3.95pp with a 95% CI of [3.4pp, 4.5pp] is the same as the values found with the Weighted ATE approach (see section ‘Comparing Effects Between Groups’). While a regression with interaction effects is helpful to compare groups, it doesn't directly provide the overall treatment effect. The Weighted ATE approach provides the within groups, between groups, and overall treatment effects.

Mixed-Effect Regression

With all the analyses so far, the blocks were considered fixed effects. It is the appropriate methodology for an analysis that aims at generalizing the results to future iterations of a campaign that uses the same definition for segments and personas. This absence of variability in the block definition was implemented in the Weighted ATE bootstrapping presented above. At each of the bootstrap iterations, a subset of individuals within each of the Treated and Control groups is randomly selected, but all the blocks are included. Thus, the variability in the Weighted ATE estimates comes from randomly selecting individuals within blocks, not from randomly selecting blocks.

If, instead, the data scientists had randomly chosen the blocks from a larger set of segments and personas and if the goal was to generalize the campaign results to this larger set, a mixed-effect analysis would be the correct approach. This approach is illustrated below with a mixed-effect regression that allows for random fluctuation across block intercepts and treatment effects:

mixed_model = smf.mixedlm('outcome ~ treatment', uplift_data, groups=uplift_data['block'], re_formula="treatment").fit()
mixed_model.summary()

The results show a treatment effect of 2.1pp with 95% CI of [-34.3pp, 38.5pp] CI (p=.91). The Weighted ATE (fixed effect) point estimate was the same, but the CI of [1.88, 2.38] was narrower and the effect was significant (p<.01, see section ‘Weighted Average Treatment Effect’). For the random-effect version of bootstrapping, a subset of both individuals and blocks would be selected at each iteration. This additional variability explains why the results are significant in the fixed-effect context but not in the random-effect one. The key is to align the analysis with the design which are both fixed in the uplift campaign example discussed in this blog.

Conclusion

In conclusion, randomized block designs can help reduce variability in statistical tests and increase reliability. However, measuring the overall ATE of randomized block designs can be challenging, particularly when there are varying treated and control split ratios across the blocks.

In this blog, we show how data scientists can tackle these challenges. We simulated a randomized block design experiment to mimic a marketing campaign like the ones we are working on at CVS Health. We then demonstrated how to apply different evaluation methodologies, including t-test, regression, Weighted ATE, bootstrapping, and permutation tests to compare the effects derived from these different approaches.

Through this analysis, we showed that t-test and regression, which are often used to estimate the treatment effect, can produce unexpected results, and overestimate or underestimate the true effect. Therefore, data scientists should use caution when interpreting randomized block design results and consider the Weighted ATE approach. Bootstrapping and permutation tests built on top of the Weighted ATE brings additional benefits as it does not rely on normality and other statistical assumptions. The functions developed for this blog post will give valid confidence intervals and p-value when the block and treatment levels are fixed and selected by the experimenter as in most marketing and health-promoting campaigns. We hope that this guide provides valuable insights into analyzing randomized block designs in Python and can be useful for data scientists and researchers in various fields.

References

Ameringer, S., Serlin, R. C., & Ward S. (2009). Simpson’s paradox and experimental research. Nurs Res., 58(2), 123–127. [PDF].

Reichardt, C. S. (2019). Quasi-Experimentation: A guide to design and analysis. New York: The Guilford Press.

Gomila, R. (2021). Logistic or linear? Estimating causal effects of experimental treatments on binary outcomes using regression analysis. Journal of Experimental Psychology: General, 150(4), 700–709. [PDF].

Facure, M. (2022). Causal inference for the brave and true. [EBook].

Grigg, T. (2018). Simpson’s paradox and interpreting data. [Medium].