Controlling Influence Between Groups in A/B Testing — Interrupted Time Series Design

Sergei Sergeev
Mayflower team
Published in
17 min readJul 12, 2023

Hello everyone, my name is Sergei, I am a Data Scientist at Mayflower. I’m building Recommendations and Personalisations systems. Of course, most, if not all, of these systems or their improvements, require thorough online testing before implementing in production. But due to the nature of tasks and data, a usual A/B testing may not be enough and even be misleading.

So, I want to discuss the so-called Interrupted Time Series (ITS) design.

It’s one of the ways to measure treatment effects (e.g., in A/B testing). This approach might be especially useful if you suspect the treatment group (the group where you test your new feature) might affect the control group.

It’s also useful for measuring the magnitude of such influence between groups. The idea behind this ITS design is simple and intuitive. In the simplest case, you compare the past, before the intervention, and the present. In other words, you are using your past data as a control group.

1. When to expect influence?

Before diving into the details of the method, let’s review the problem of influence between groups.

  • One of the most obvious cases is social network data. If we want to test the effect of the intervention on one group and select users independently at random, the groups will not be independent. This is because users in both groups interact with each other, and changes in behavior in one group might significantly affect behavior in the other.

But similar network effects might emerge even if there is no explicit interaction between users.

  • For example, in online retail, we may say that users who buy the same item are connected. Although this connection will not be a problem in many cases, there are situations where it is. Say, we want to test a new sorting algorithm in a catalog, and in both old and new algorithms item’s position is somehow correlated with the number of purchases, among other things. Say, the new model is able to find and predict very relevant items that the old one could not. And users in the treatment group will buy these items. In this case, these newly predicted items may appear on top positions in the catalog in both groups and the measured effect will have a negative bias. And so, even if users were picked independently, interactions with items make groups dependent.
  • Another example is food delivery services. We are testing a new feature that makes delivery faster and customers love it. There may be an increase in the number of orders in the treatment group, which may lead to an overload of restaurants. This will affect both groups, so in this case, customers are connected through the restaurants when they make orders. Because the treatment group has better delivery conditions, fewer users from this group will change their minds and cancel orders. As a result, there will be a positive bias in the measured effect.

On the other hand, there is obviously no influence of the treatment on the past. So by measuring the user behavior in time before and after the intervention, we can measure the effect.

Moreover, we can measure not only the effect itself but its changing with time as well. It might be important because it allows us to differentiate between an actual effect of treatment and an effect of novelty.

2. A/B Test as Linear Regression

Let’s discuss how it all works and how it relates to the usual A/B testing.

Suppose, we are testing how revenue per user changes if we show them a new shiny button instead of an old boring one. Often we want to know, say, whether the average user will bring us more money, so ideally we want to compare expected values of revenue if all users see one or another button.

In reality, we can’t compare them. One of the reasons for it is that we can’t show both buttons simultaneously to a given user. Thus, we compare the next best things — estimates of expectations, namely averages.

So we collect data from both groups with cool and boring buttons, average results, and compare them using some statistical tests. Instead, we could do something a bit different.

We can use the Ordinary Least Squares (OLS) algorithm.

Remember, that when using linear regression in the form above what we get is an estimation of expected y given x. It means that if we collected data and constructed a model using the OLS algorithm, when we plug in a particular number x, we get an average y for that x.

In linear regression, this number x represents some property of a user. So we get an average y (metric) for a user with this particular property. But this is exactly what we want from A/B testing. We want to get averages for users with properties that are in control or treatment groups. The easiest way is to denote x = 0 for the control group and x = 1 for the treatment group.

In this case, we get y = 𝛼 for the control group.

And y = 𝛼 + β for the treatment group.

The effect is the difference between averages which is a coefficient β.

As a bonus of using well-known and understood OLS, we get confidence intervals and p-values for all the coefficients, including β for free.

And that’s not all! Since we established that we can view the treatment effect as a coefficient of regression, we can use all the power of OLS and its generalizations to analyze testing.

One of the most useful applications is that we can use other properties of users within the same framework, not just the property of being in a treatment or control group. For example, the user’s age, income, education, country, or any other available data.

The only restriction is that these properties must be independent of treatment. Why is this useful? Because it increases the power of your tests. Let’s see how it works in the context of regression.

Suppose again, that we are measuring average money that we get from groups A and B. Again, the usual A/B test would be equivalent to the regression with just one factor.

But if we have additional data we can incorporate it to regression as well in the form of additional variables.

If these new variables are good predictors of outcome they will explain away a lot of variance. And the less the intrinsic variance of the metric the easier to detect changes in this metric that were produced by the intervention.

Let’s see intuitively why this happens.

Suppose an additional variable is the user’s income. It is reasonable to expect that users with greater income pay more. But at each particular level of income the variance of the metric is lower because a lot of variance of a metric is due to variance of income.

What happens, is you virtually compare treatment and control groups at each level of income. Because at each level the variance is lower it is easier to detect changes due to intervention. By adding new variables you virtually create new smaller and more homogeneous subgroups each having even smaller variance.

By the way, if you are familiar with variance reduction techniques such as CUPED or post-stratification, this is essentially it.

It’s worth also noting that the connection between hypothesis testing and linear regression is much deeper than described above. A/B test is just a simplest example of a causal inference. But there are a lot of situations where we can not properly divide the population into treatment and control groups (especially in economic and social sciences), but still want to measure the effect of treatment, i.e. the causal impact of our intervention. In this case, there are a lot of quasi-experimental techniques, a lot of which are based on linear regression. For more details and insights I highly recommend an excellent online book Causal Inference for The Brave and True¹.

3. Time for time!

Ok, so far, so good. We have our metric and variables describing users’ properties. The metric depends on these variables. We know how and why to use these variables to calculate the treatment effect. Now it’s time to bring time to the picture. It will be just one more generalisation.

When we are considering a process in time we have to assume that the outcome depends not only on external variables but also on itself in the past. For example, if today is Thursday it is reasonable to expect that revenue today might be somehow correlated with revenue yesterday and with revenue in the past Thursday. We might end up with something like:

This is called autoregression because it is a regression in itself (in the past).

Actually, the outcome may also depend on the random noise in the past and it is called the moving average process.

I will not go deep into detail here but the intuition is that we can treat these temporal factors as kind of variables in the regression. But to get correct error boundaries of estimations, some generalisations have to be made.

This generalization of linear regression that takes into account these kinds of dependencies on the past as well as independent variables is called the SARIMAX model and is also very well studied and understood. And it can also produce rigorous confidence intervals and p-values.

There are, however, few noticeable and useful differences:

  • We have to take the trend into account. For example, revenue might increase with time just because of inflation
  • We make comparisons with the past, so the treatment variable becomes 0 before the intervention and 1 after
  • Now we can detect not only the effect on average but the effect on trend as well. So we have to add a new variable for it
  • As data points are not users anymore but periods (e.g., days), the interpretation of independent variables also changes. These are now properties of periods as well. For example, instead of user income, we may consider the average income of users who visited our site each day. Or even weekday dummy variables for each day.

Here is an example of a process in time that abruptly changes after the intervention:

Image from an review² of different methods of Interrupted Time Series studies

So, as linear regression may (and frankly should) be considered as a base for A/B testing, SARIMAX is a base for detecting and measuring the influence of the intervention in time. And this approach to compare metrics at different time intervals, before and after intervention, is called Interrupted Time Series design.

The formula that describes this quasi-experimental setup looks like this:

Before the intervention, when x = 0, the behavior is mostly defined by level β₀ and trend β₁. But after the intervention at the moment T, when x = 1, the new level becomes β₀ + β₂ and the new trend becomes β₁ + β₃. The [t-T] coefficient allows us to start counting time intervals from the moment of the intervention.

Before going to concrete cases, one last thing I want to discuss briefly is how to select features (or exogenous variables) and parameters for the SARIMAX model.

As was mentioned before, the main condition for exogenous variables is they have to be independent of treatment. As for SARIMAX parameters, you do not want to overfit your model. It means you don’t want to construct a model so complicated and powerful that it will explain everything in your observed data, including random noise.

In that case, if you use your model for predictions, you will get poor results. And even if our aim is not to predict anything, it is obvious that a model with better predictive power is more trustworthy.

One of the methods to detect overfitting is cross-validation, leave-one-out (LOO) cross-validation in particular. For each data point, we construct a model using all data points except this one and use the model to predict it.

By measuring the average error on the whole dataset, we estimate model’s generalisation abilities. Although this method has some cons, one nice thing about it is that for linear regression, we can easily construct an exact analytical estimation without needing to recreate a model for each point in the dataset.

Things become more complicated when dealing with temporal data, but fortunately, there is a method that is asymptotically equivalent to LOO. It is called Akaike Information Criterion.

Roughly speaking, the lower its value for a model on the same data, the better the model. So when you try to select how many exogenous, past, or seasonal factors to consider, look at this criterion value.

4. Examples

Ok, now let’s look at some cases.

First, we used this approach postfactum to test the influence between groups during the test of the recommender system. The idea to test this influence came to us when we planned the design of the A/B test of the Multi-Armed Bandits (MAB) algorithm.

In short, this is a kind of sorting algorithm that will affect the whole site. Our current sorting method is related to popularity, so the influence during testing is expected.

It is hard to estimate such influence in advance, so we decided to measure the influence between groups during the test of the recommender algorithm. Recommender is smaller, although the very popular section on our site. And the idea was that if we can detect a significant influence in this case, such influence on a larger-scale test for MAB is inevitable.

4.1 Basic example with one group

First, let’s look at a basic example.

This is just a treatment group before and after the intervention, which means with old and new recommender algorithms.

For this and the next examples, I will use two metrics: one is related to users with a desired behavior and I will call it the “number of users”. The other is related to user spending and I will call it “money”.

It is interesting to note that one of the additional differences of this approach is that we can compare not only averages (like money spent by users) but also counts and amounts such as the daily number of users and daily amount of money.

As you can see, the metric for the number of users has a much smaller variance. This is intuitively expected because the uncertainty of the money metric comes from the number of users (which is more or less the variance of the user metric) and the amount of money a user spends.

We can expect that it would be harder to detect an effect on this metric due to its higher variance.

You have probably already noticed that there may be some change in behavior after the intervention. To be sure, we have to test for the statistical significance of this change. We have to account for the dependence of the metric on itself, that is, autoregression. It is better to reduce some variance by introducing exogenous factors. In this case, they are simply dummy factors for weekdays and a quadratic factor for a day of a month.

What we are seeing here are metrics for users and money, respectively, when controlled for all factors, including time-related, apart from effect.

By the way, in this case, and the next one, the trend does not change after intervention. I also want to stress that factors are divided into effect and others just for visualization purposes.

There are no actual differences during calculations. It is important because otherwise, you risk getting very wrong estimations of your effect.

Now, we see that for the user metric, the 95 confidence interval does not include zero (or estimation of metric without intervention). As for the money metric, the interval does contain zero, so the effect is not significant. The reason for this, apart from the actual absence of effect, is that because of high variance, the method is not powerful enough.

4.2 Intergroup influence

Ok, so it was an example of the most basic usage of this approach. Now, let’s see a more complex one.

We have two groups, test and control, both changing with time. And if there is an influence, we can expect that the metrics of the control group will change after the start of the intervention, maybe even statistically significant. Let’s look at the treatment group first.

This is the picture of the evolution of metrics before the intervention (orange), during the test (blue ), and after (green), we decided to use the treatment variant for all users. We can see, there is a statistically significant change after the experiment, i.e., after the exposure of a control group to the new recommender algorithm. This means that intergroup influence is significant.

This can be confirmed if we look at the control group.

Indeed, during the experiment (blue), the number of users increased significantly. Because the SARIMAX model was used to estimate the effects of the intervention, we can be pretty sure, that this is not a random temporal fluctuation in metric, but a causal effect of an intervention.

One important consequence is that in the usual A/B test, we measured the difference between groups but it is not an actual effect. There is a negative bias in this case. The actual effect can be estimated as the difference between post- and pre-experiment levels.

Another notable fact is that estimations for both groups are basically the same, which means that the method is pretty consistent. After the test, both groups became equal again, as expected.

This is also an illustration of conducting multiple tests simultaneously using this method. It can be done as long as you start your tests at different times.

This is a property of linear regression. As long as your factors are not totally collinear, you can estimate their coefficients. At the first point in time, we have our first intervention, which is turning on the new algorithm for group B. Because the test was successful, we did not turn it off, so the intervention continued. It is present during the second intervention, which is turning on a new algorithm for group A.

Here we are measuring one effect in the presence of another. So, in conclusion, for this case, we see that we can measure the influence by measuring the effect of the intervention on a control group.

4.3 Switchback

As a final example, I want to show you some of the different experiment designs where this approach might be used. As an example from econometrics of implementing some policy. I will not go into detail about this policy. I want to show another way to find and measure the effect.

Image from a chapter about ITS of causal inference book³

Here, the policy was implemented at some point at a time and, then, canceled and finally reimplemented again. So there are two points where we can detect and measure the effect to get a more reliable estimate. It may also be very useful in situations when intervention gradually changes the metric but cancellation has an abrupt, easily detectable effect on it. For example, the MAB algorithm may take some time to converge after implementation.

That’s why we decided to use this particular method for our MAB test.

5. Conclusion

That was a review of the Interrupted Times Series design, a method that can be used as an addition or sometimes an alternative to other methods to measure treatment effects such as usual A/B testing, synthetic control, etc. In conclusion, I want to mention two things.

The main risks that you should be aware of when using this method.

  • You have to have enough observation points. The absolute minimum is 8 before and 8 after the intervention, but it is usually recommended to have at least 100 in total.
  • And in contrast with the usual AB test, the increase in the sample size of users will not lead to an increase in power. You can try to divide your time intervals into smaller ones. This might lead to an increase in variance and the number of autoregressive factors. Your model may be more prone to overfitting. Or not. It may actually be a viable option but you have to spend a lot of time researching. In short, usually, this is hard to automate.
  • Another risk is that there might be an unexpected event of unknown duration that will affect the results. This is especially risky if there is only a treatment group. So you may need perform a separate analysis to detect such events.
  • And the last one is that even though we can conduct multiple tests simultaneously, the more tests we conduct, the harder to detect their effects. This is again a property of linear regression. The effect and trend factors for different tests are partially correlated and so the variance of their coefficients increases.

And the final point I want to discuss is alternative approaches to deal with the influence between groups caused by network effects. In short, the idea is that we find clusters in a network and use these clusters as units in randomization.

Image from an article⁴ by Meta Research about dealing with network effects

This is a very general and robust approach and I think this is what you should aim for when building a testing platform to incorporate network effects. But it is expensive and takes a lot of time to implement. So if you suspect that there is a significant influence between groups in your tests, you may want to start implementing this method.

In the meantime, you can use the Interrupted Time Series design as a quick, cheap, and pretty reliable alternative.

References

Flavio Regis de Arruda (2022). Interrupted Time Series (ITS) in Python Interrupted Time Series (ITS) in Python | xboard.dev

[1] Matheus Facure Alves (2022). Causal Inference for The Brave and True Causal Inference for The Brave and True — Causal Inference for the Brave and True (matheusfacure.github.io)

[2] Turner, S.L., Karahalios, A., Forbes, A.B. et al. Comparison of six statistical methods for interrupted time series studies: empirical evaluation of 190 published series. BMC Med Res Methodol 21, 134 (2021). https://doi.org/10.1186/s12874-021-01306-w

[3] Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Houghton, Mifflin and Company. Experimental and quasi-experimental designs for generalized causal inference. (apa.org)

[4] Brian Karrer, Liang Shi, Monica Bhole (2021). Testing product changes with network effects Testing product changes with network effects — Meta Research | Meta Research (facebook.com)

--

--