Cracking Correlated Observations In A/B Tests With Mixed Effect Models
A/B testing is the workhorse for many of our product decisions. A/B testing seemed easy to me at first glance: split up observations, compute a metric, then compare! But whether it’s problems with early stopping, dealing with the multiple comparisons, or deciding between a frequentist or bayesian framework, there are a lot nuances to consider.
Many statistical tests are designed with rigid restrictions and it is easy to violate these assumptions when faced with business problems. I want to talk about one of the mistakes I made surrounding correlated observations and how mixed effect modeling using the lme4 R package provides a solution.
Example Of The Problem
Let’s imagine a company that builds a personal calendar app to help people manage their tasks. One thing the company has noticed is that some people tend to queue up tasks for themselves but don’t end up completing them on time. To address this problem, the company designed some UI changes to help people organize. Before shipping the change out, they want to perform an A/B test to make sure the UI change is actually increasing the chance tasks will be completed on time.
In order to maintain a consistent user experience, a user should have the same UI throughout the experiment. In other words, the treatment is randomly assigned by user. At the same time, the company wants to measure the effects of the UI at the task level. Additionally, different users can have different behavior when it comes to using the app (this we will see causes a problem later). For example:
- There are some power users who create a ton of tasks in the app (high engagement) while others use it sparingly (lower engagement).
- Some users tend to complete their tasks on time while others are late more often.
A natural way of analyzing the experiment would be to treat each task as an observation (1 or 0 depending on whether it is late or not) and perform a chi-square test counting a task as part of the treatment group if the corresponding user is part of the treatment group. However I will demonstrate that because treatment was assigned at the user level and the analysis is done at the task level, this can lead to high false positive rates in the analysis. Let’s simulate some data and perform some A/A tests to get this point across.
For those of you who don’t know, an A/A test is a way of sanity checking your analysis. You look at two identical groups (simulated in our case) and see what your analysis says about their difference. With a chi-square test at 5% significance (false positive rate), we should reject the null 5 out of 100 times.
When simulating our task data, we should take into account the user differences, outlined above so we simulate as follows. For each user (about 5000 in my simulation):
- Randomly assign them to treatment or control.
- Draw from a distribution to determine the user’s “base rate” for how often they mark their tasks on time (fig 1).
- Draw from a distribution to determine how many tasks the user will create over the test period (fig 2).
- Sample from a Bernoulli distribution (with the user rate) to determine for each task if it was completed on-time or not.
Note that the treatment assignment is just a label. After a user got their assignment, it didn’t change how often their tasks were marked late in any way.
After generating 100 of these datasets, we get 23 out of 100 of them have a significant difference… What happened? There wasn’t any effect size. It should have been 5 out of 100! Something must be wrong with how we are planning the experiment.
I ran into a similar problem as above when I was working on a problem at Convoy. Eventually I stumbled on a post from Wikimedia Foundation which succinctly states the problem. In short, since we are using a chi-square test we are assuming the observations in each group are independently and identically distributed (IID), but they aren’t! Two observations from the same user are highly correlated since their outcomes are influenced by the “base rate”. Because observations coming from the same user are correlated, this reduces our effective sample size and leads to a high false positive rate. This problem is particularly pronounced when you have many “power users” who generate many correlated observations.
Ellery, who wrote the post above, solved his problem by aggregating his observations to the same level that treatment was assigned (in our case aggregating up to the user level). For him, there weren’t that many repeated observations for each user, but for us there are(in the simulations there were about 25,000 cards but 5000 users). Aggregating would waste a ton of data but treating each observations as IID is statistically invalid. There should be a compromise where each observation tells us something about the user AND the treatment.
Mixed Effect Modeling And The lme4 Package
We can address the issue raised above by creating a statistical model that accounts for the grouped nature of the data. We could train a simple logistic regression model
- α is the intercept
- T(i) designates whether or not task i was in the treatment group
- P(i) is the probability task i was marked late
But this would still not take into account the user effects on tasks. We could add a categorical variable marking which user generated the task, but this model would simply learn to predict the outcome based on the user alone and wouldn’t tell us anything about treatment. Alternatively, we can write
Where 𝛾(i, j) is a random effect for each user. Since β is not random, it is called a fixed effect. Together, this would be called a mixed effects model (while this section quickly talks about mixed models, you can check out this excellent blog post from Stitch Fix to learn more). Adding the random effect allows the model to explain part of the outcome of the task from the user while also explaining part of the outcome from treatment. The resulting treatment fixed effect gives us an interpretable value for the on-time impact the UI has on a given task.
Training models like above is easy to do in R using the lme4 package. We would write:
m <- glmer(late ~ treatment + (1 | user), data=df, family=binomial)
Where the (1|user) syntax represents a random effect for each level of user, late is the Bernoulli variable representing whether or not the task was marked late, and treatment is the fixed effect for the treatment. After training 100 models using the same data above and calculating whether or not the treatment variable for each model was significant, we get a false positive rate of 4%, which is around what we expected from the beginning!
If you wanted to make your inference more powerful, you could add other features as well such as length of the task, description, or demographics of the user. This improves power by accounting for the possible imbalance of different types of users when you assign treatment.
Lastly, one thing that an attentive reader might have thought is “wait, why couldn’t we treat all the coefficients as random with a Bayesian model?” The answer is, you can, but it takes a little more tweaking and work. Gelman and Hill wrote a very useful book where they suggest getting a sense of things using the lme4 package and then later using software such as STAN, or PyMC. In any case, using a mixed effect model can address the correlated data in your experiment and help you go back to innovating your product!