Why Spillover Effects Bias Your AB Testing Results and Ways to Overcome them

Weonhyeok Chung
5 min readOct 15, 2022

--

On many platforms, one user’s action influence other users’ actions. For example, if I use certain products, then the people who are connected to me are more likely to use them. In the case of AB testing, if I am the one who is treated and some of my friends are not, then the effect is underestimated since both the purchase rate of the treatment group and that of the control group increase.

In this post, I discuss why spillover effects bias your AB testing results and ways to overcome them. The contents are based on chapter 22 of “Trustworthy Online Controlled Experiments” [2] but I added my own detailed explanations and examples from it.

Photo by Maria Lupan on Unsplash

I. Introduction

The spillover (leakage, interference) is the case when one unit’s action affects another unit. Suppose we analyze units based on users and I receive the “People You May Know” notification. I am more likely to send an invitation to the person than not. If the person accepts my invitation, then both of us become friends. Suppose I were in the treatment group and the friend was in the control group. Then, the number of friends for the treated group less the number of friends for the control group becomes zero in this two-people example. Thus, the actual effect is underestimated than the true effect.

This concept is important as most of the analysis assumes SUTVA (Stable Unit Treatment Value Assumption). In other words, one unit’s action does not interfere with another one. In the case of e-commerce, these spillovers are rare. Still, when the product is popular, the product can be out of stock so spillover can occur.

II. Example

Facebook’s people you may know or zoom calls are examples of the direct connection. People are directly connected to each other.

In the case of Airbnb, this is the problem with the stock. Suppose Airbnb promotes a 50% discount and we run an experiment to study the effect of the promotion on purchase rate. Suppose group A received the promotion but group B has not in a certain city. When people from group A rent apartments, then the apartment that group B can rent diminishes. If the cost-effective apartments sold out first, then available apartments that group B can purchase are the ones that are less cost-effective. This decreases the purchase rate of group B. Thus, the effect of promotion can be overestimated compared to the true ones.

Similar to the “inventory problem” in the Airbnb case, mobility company such as Uber or Lyft has a similar problem. When we promote discounts in group A in a certain region, then the supply decreases due to the high demand. Then, the surge price of the group without promotion increases and decreases the demand for the control group. Thus, the effect can be overestimated than the true ones.

In the case of an auction such as eBay, when one group wins the auction another group loses. For this reason, when one group receives events beneficial for winning the auction, then the other group is more likely to lose. Thus, the treatment effect can be overestimated.

The budget for the pay-per-click by the vendor is limited. For example, suppose a vendor shares budgets for two products they listed on an e-commerce website. When a certain product of the company is treated by an algorithm that increases exposure in rankings, then the amount of budget in the remaining control group product decreases. Thus, the treatment effect can be overestimated.

Relevance model training can also violate SUTVA. Because the algorithm trains the model, the effect amplifies. For example, when a certain product has been exposed to certain keywords, then the algorithm learns the keywords and the exposure increases. In my point of view, this is a very interesting part of the current recommendation system and is a beneficial area to study and apply.

When the treatment group and control group share a server and latency occurs for the treatment, then this affects the control group. This is related to guardrail metrics.

When a page view is a unit of observation, then the within-user latency improves over time because of cookies. This also violates SUTVA as the observation from a later time is affected by the previous observation.

III. Practical Solutions

An important metric for social network services such as LinkedIn is “Downstream impact”. In the case of LinkedIn, we can distinguish users who create posts and who read posts. Then, the important metric is “total feedback received”. And we can analyze the experiment based on the creators.

The past experiment can be used as an instrument (IV strategy in causal inference technique). For example, the analyst can use notification queueing (or messaging queues that show new jobs, promotions, job anniversaries, and birthdays) as a natural experiment. If user A receives a notification, which is friend X’s “celebrate friend X’s promotion” in her feed, then the user is more likely to send a message to person X than the users who did not receive such a message. If the notification is randomized, the analyst can use it as an instrument to study the causal effect of the message on user engagement.

When the advertisement budget is problematic, we can run experiments by splitting the identical budgets for each product. Also, the traffic of the server for the treatment group and control group can be problematic. In such a case, we can split the variant by 50:50.

In the case of Uber, geo-based randomization can be a solution. In research papers using Uber data, clustered randomization has been adopted.

When time-based randomization occurs, we can run experiments by time and assume time series models. In Bojinov and Shephard (2017) [1], a hedge fund runs a human vs. algorithm experiment to test which method works for returns on Index Future Options.

In the case of LinkedIn or Meta, clustered randomization within a network is plausible. However, it is difficult to find completely isolated clusters. And the number of large clusters is small, and when we increase the number of clusters, they are more likely to be connected to each other.

An analyst can distinguish “ego” (center of the cluster) and “alter” (the edge of the cluster). We can run experiments based on ego. From my understanding, the “ego” is an influencer. There are influencers in the developer community and influencers in data scientist communities. Maybe in the case of Medium, they can run experiments based on influencers with different keywords.

IV. Take-ways

(1) Many of the tech companies are in the form of a “Many-to-Many” matching system (many sellers and many buyers). The lack of inventory in the case of Uber/Lyft or Airbnb can violate SUTVA. When a certain group of buyers is treated, then the supply of the control group decreases, thus the effect of treatment is overestimated.

(2) There are platforms where “mutual friendship” is likely to occur. When one group is treated, recommended friends for example, then the outcome of the control group also increases. Thus, the effect is underestimated.

(3) When the users share resources, servers, or advertisement budget for example, then the control group can lack resources. Thus, the effect can be overestimated.

References

[1] Bojinov and Shephard (2017) “Time Series Experiments and Causal Estimands: Exact Randomization Test and Trading”.

[2] Kohavi, Ron, Diane Tang, and Ya Xu (2016) “Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing”

--

--