How to compare two treatments?

by Amélie Héliou and Thibaud Rahier

Published in

Criteo Tech Blog

8 min readJan 12, 2021

The goal of this blog post is to give insights on how to determine if a given dataset is suitable to answer the question “For a given (group of) user(s), is their expected outcome higher when they are assigned to treatment A or treatment B?”.

We will see that treatment effect estimation relies on multiple assumptions concerning the relationships between user, treatment assignment, and outcome. Statistically testing for these hypotheses is beyond the scope of this blog post. Instead, we henceforth assume to have the knowledge of the data-generating mechanism.

NB1: The theoretical field behind this blog post is called “causality” [1–2], and more precisely “individual treatment effect estimation” [3–4]. We avoid theoretical terms and considerations to focus on practical insights.

NB2: We focus on explaining biases that might arise and how to deal with them. For mathematical and statistical comparisons, see this blog post.

Why do empirical averages not always work?

This seemingly naïve question is a particularly important one. It is extremely hard to refrain from the tempting comparison of empirical averages. However, selection bias (and other biases) often makes two empirical averages not comparable as such.

Let us illustrate this with the famous Simpson paradox. The usual example concerns kidney stones treatments. There exist two treatments — T1 and T2, and two outcomes — recovery or not.

We look at the recovery status of 500 patients to whom was assigned treatment T1 and of 500 patients to whom was assigned treatment T2.

Treatment 1 seems better … But there is a selection bias! We have two groups of equal size, but the treatment was not randomly assigned. T2, an intrusive surgical procedure, was preferably assigned to patients with big kidney stones (who globally have a lower probability of recovery) and the lighter treatment T1 was preferably assigned to patients with small kidney stones (who have a higher intrinsic probability of recovery).

If we look at the data broken down by kidney stones’ size, we obtain:

Recovery rates broken down by kidney stones’ size

The conclusion is thus inverted, T2 now looks better for both small and big stones. The interesting thing about this paradox is that we could imagine another characteristic (age, diabetes, …) that influences both the treatment assignment and the outcome and that could again reverse the conclusion.

We say that the size of the stone is a confounder between the treatment assignment and the outcome because it impacts both. Such confounders often introduce selection bias, which makes the empirical means unreliable.

How to do it in practice?

We will consider different cases, if confounders exist or not, and if the treatment assignment is random or not. All cases can be summarized in this figure.

Graphs of the possible cases, we will only study 1, 2, and 3

We observe the treatment (T), the outcome of interest (O), and the features of the user (X) — in the following the term “user” will refer both to the user and to the instance of X that represents them. U represents hidden confounders that we do not observe.

We will only consider cases 1, 2, and 3 where there are no hidden confounders. The presence of hidden confounders adds considerable difficulties that go beyond the scope of this blogpost.

Support assumption: Depending on the case we will consider treatments’ outcome for a given user or in expectation over a test set. When we use models to predict the outcome given treatment and user, the models need to be able to generalize to all users in the test set. The support assumption does not always hold, e.g., when some users have never been assigned to one of the treatments.

a) Randomized treatment (case 1 and 2)

A natural way to ensure there are no hidden confounders between treatment assignment and outcome (T and O) is to randomize the treatment (cases 1 and 2), thus avoiding the risk of selection bias.

However, having a randomized treatment does not guarantee the reliability of empirical means for treatment comparison.

Let us look at an example in online advertising. We consider a pool of 1000 users, which we randomly divide into two groups A and B. We then show different ads to group A and group B, and look at the number of sales (or any outcome of interest) which were made by users in both groups.

As far as we count the number of sales from the beginning of the test, everything is fine (case 1: treatment assignment T (i.e., the group A or B) is independent on the user features X).

However, the more the test lasts, the more the users from groups A and B will start to differ, they weren’t exposed to the same ads, so their features have evolved differently. It is still possible to learn a model on group A that predicts the probability of sale of a user and a similar model on group B. But we can no longer compare the average of outcome or of prediction in each group because they differ in distribution (case 2: treatment assignment T (i.e., the group A or B) influences the user features X).

For example: Imagine we started a random AB-test for online advertising a month ago. Group A was shown coupon ads with significant promotion and group B was shown ads without promotion. Every month, 70% of users in group A that don’t have the product buy it and 40% in group B.
After one month the distribution of users that have the product is very different (70% in group A, 40% in group B). If we only look at the results of the second month, group A has 30*0.7 = 21 new buyers and group B has 60*0.4=24 new buyers. Group B looks better because the distribution of users that already have the product is different.

A way to circumvent this problem is to learn a model on population A, another on population B, and to use a common test set to compare the prediction of both models. This way, the comparison is fair because both averages are taken on the same population (Support assumption must be met).

To summarize, randomized treatment is perfect (at the individual and population level) when we start to look for outcomes at the assignment of the treatment (case 1). If we gather users’ features and outcomes after the treatment assignment (case 2), randomized treatment is only perfect at the individual level, for the population level, one has to be careful when choosing the population on which averaging the predictions (support assumption).

b) No randomized treatment (case 3)

If the treatment assignment is not randomized, we can only compare the two treatments if we observe all the causes of the treatment assignment (U doesn’t exist). Otherwise, a hidden cause can be a hidden confounder if it also affects the outcome, and we are at risk of encountering a Simson paradox.

Assuming we can observe all the confounders between the treatment assignment (they are all contained in X) and the outcome, we have three choices.

Break down the result per causes’ modalities (direct modeling)

For the Simson paradox, it means comparing the recovery rate of T1 and T2 among patients with small stones in one hand and with big stones on the other hand. This is easy to do when the confounder (the size of kidney stone in this example) is discrete and with few modalities.

When the confounders are more complex (continuous or has many modalities), a model must be learned to predict the outcome. At inference time, treatment comparison must be done conditionally on all the confounders.

Reweight each outcome by the probability of receiving the treatment (inverse propensity scoring [5])

This consists of dividing users’ outcomes by their probability of being assigned to a given treatment. For the Simson paradox, inverse propensity scoring gives a recovery rate of:

And thus, we circumvent the Simson paradox and rightfully conclude that T2 is better.

When the confounder is discrete and has few modalities, the probability of each user to receive a given treatment is computed by an empirical average on observed data.

When the confounder is continuous or has many modalities, we need to learn a model that predicts the probability of assigning treatment to a user.

This method enables to debias the comparison between treatment at the population scale but can’t be used for individual prediction.

The best of both worlds (doubly robust estimator) [6]

This estimator combines the last two methods, it is notably interesting to use it for high dimensional or continuous confounders, since it has better variance guarantees.

What about filters?

We often want to filter out some data that are “useless”. In the case of online advertising, this can be, for example, users that never saw an ad.

This is very tricky; filtering data almost surely lead to inserting bias. Depending on which kind of data you have (case 1, 2, or 3), there are different things to ensure.

Case1 (X independent of the treatment): you can check that the treatment is still independent of the features, you shouldn’t be able to predict the treatment from the features

Case 2 and 3: the support assumption must still be met.

Conclusion

Comparing the effects of two treatments is often tricky, empirical averages can lead to the wrong conclusion. One must be careful and aware of potential biases. Having a good knowledge of the data-generating mechanism allows us to understand what the potential biases are and circumvent them.

Like what you are reading? Check out our latest publications on Medium.

My Ph.D. time at the Criteo AI Lab

Ugo is a final year Ph.D. student who is getting ready to defend his thesis. We talked with him about his work…

medium.com

Why your AB-test needs confidence intervals

…or any other form of statistical testing!

medium.com

Interested to learn more about this first hand? Check out our open roles:

Careers at Criteo | Criteo jobs

Find opportunities everywhere.

careers.criteo.com

[1] Pearl, J. (2009). Causality. Cambridge university press.

[2] Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of causal inference (p. 288). The MIT Press.

[3] Radcliffe, N. J., and Surry, P. D. (2011). Real-world uplift modelling with significance-based uplift trees. Stochastic Solutions.

[4] Diemert, E., Betlei, A., Renaudin, C., and Amini, M.-R. (2018). A large scale benchmark for uplift modeling. In Proceedings of the AdKDD and Target Ad Workshop

[5] Rosenbaum, P. R., and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects.

[6] Funk, M. J., Westreich, D., Wiesen, C., Stürmer, T., Brookhart, M. A., & Davidian, M. (2011). Doubly robust estimation of causal effects.