Increasing power in the new users experiments

Evgeniy Vasilyev
SmartNews, Inc
Published in
8 min readJun 24, 2023

Introduction

A/B testing is a valuable tool used by many companies, including SmartNews, to optimize their products. By randomly assigning users to control and treatment groups, this method allows us to compare different feature versions and measure their impact. However, when the difference between the test and control groups, known as the effect size, is small, it can pose a challenge to detect statistically significant results due to the test’s low power.

There are techniques available to increase the power of the test and detect even smaller effects including widely spread in the tech industry CUPED and DID methods. Unfortunately, these approaches don’t work well in case of new users experiments and in the blog post below I will describe how we approach that problem at SmartNews.

Linear methods for variance reduction

Let’s begin by examining the principles behind CUPED. Say, we would like to measure effect for target metric Y, to do that we take some feature (covariate) X which is known before experiment and which was not impacted by the experiment and then construct new metric as difference between target metric Y and covariate X:

Where 𝜃 is some coefficient proportional to covariance cov(Y, X) (in case of DID it equals to 1). Subsequently, we are running statistical test for new variable Ŷ, which has the same mean as original variable (unbiased estimation) but smaller variance and therefore power of the test is increasing. Equation below shows statistical test for new adjusted variable Ŷ, where 𝜏 is estimated effect size and W is binary variable which shows assignment to the test group:

Variance reduction for new users experiments

It can be shown that the higher correlation between target metric Y and covariate X, the more variance reduction we can get and the higher would be power gain. Hence, it is common practice to select the target metric itself before the experiment as the covariate X in CUPED method. For instance, if our target metric is revenue, we can use the revenue before the experiment as the covariate. However, it is not always possible to select covariate X in such way, for example, in case of new user experiments we don’t have target metrics before the experiment as users just joined the platform.

At the same time, we might have some useful information about new users before experiment which can be used for variance reduction, such as device information, marketing channel, region, possibly age/gender etc. These features, though, often exhibit low correlation with the target metric and do not significantly reduce variance when incorporated into the linear CUPED equation. Furthermore, many of these features are categorical and may have high cardinality (e.g., location or device model), which makes it challenging to use them in CUPED framework. So does it mean there is no good way to increase power in new users experiments?

Non linear methods for variance reduction

Fortunately, those problems can be partially addressed with novel ML-based variance reduction techniques that have been introduced in recent 3–4 years. The main idea behind these approaches is to replace covariate X, for instance in CUPED framework, with some nonlinear function of several variables g(). For example, we can use any ML model (XGBoost, NN), which predicts target metric Y with the set of covariates.

One such approach has been described in a research paper, also similar approach has been used by DoorDash, which they called CUPAC

Similarly to CUPED, the higher correlation between new covariate g(..) and target Y, the more variance reduction we can get. And because we are using several features and nonlinear function (ML model), we would likely get higher correlation than in CUPED method. That is especially helpful in case of weaker predictors available in new users experiment.

Slightly different approach has been described in Meta’s paper, suggests incorporating the nonlinear model into the full regression equation, rather than within the CUPED framework:

The full regression approach offers several advantages over the two-step CUPED approach, especially in scenarios involving imperfect randomization, which are illustrated in this excellent article. Based on these considerations, we at SmartNews have chosen to adopt the approach that incorporates the ML model into the full regression equation.

Approach

Our approach involves several key steps:

  • Gather the covariates available for new users before the experiment, such as device information etc. It is crucial that these covariates remain unaffected by the experiment to avoid introducing backdoor path that could lead to incorrect estimations.
  • Since most features are categorical, they need to be processed before being used in the ML model, e.g. with OHE. Another approach, that we selected, is to leverage CatBoost, as it automatically handles categorical features effectively
  • Build an ML model with CatBoost for each target variable to be tested. To mitigate bias from overfitting, we employ a cross-fitting procedure for predictions. More details on this procedure can be found here
  • Incorporate the ML predictions into the full regression equation and estimate the treatment effect.

Experimentation

To evaluate the effectiveness of the approach described above, we conducted a series of experiments comparing it with regular statistical test for analyzing metrics such as time spent, session count, revenue, and active days in new user experiments. We compared results in terms of the confidence interval (CI) width gain. In other words, we measured how much the CI width was reduced when method with ML variance reduction was applied (ML VR). Also we assessed how much smaller sample size would be needed in method with ML VR to achieve the same CI width as in regular statistical method. Furthermore, we also implemented the CUPED method for comparison purposes. As we discussed before, using CUPED is a bit tricky in new users experiments as most of covariates are categorical features with high cardinality and needed to be processed, to simplify the task, we selected a single covariate with low cardinality, typically choosing the platform which has only two values Android or iOS.

The results indicate that both the CUPED and ML VR methods yielded some sample size reduction compared to the regular statistical test. However, the sample size reduction achieved with CUPED was negligible, averaging at 1.5%. On the other hand, the ML VR approach demonstrated superior performance in terms of model accuracy and sample size reduction. Although the ML VR approach still exhibited relatively low accuracy for new users, primarily due to the difficulty of predicting metrics like time spent based solely on registration features, it was able to reduce the required sample size by an average of 12%. In some cases, the reduction reached as high as 30%. This easily translates to significant time savings, equivalent to days or even weeks that would otherwise be required to conduct the experiment.

Automatic pre-bias adjustment

Another valuable feature of this approach is automatic pre-bias adjustment across many features. In many experiments, pre-bias exists due to factors like imperfect randomization, making it challenging to detect and adjust for such biases. To illustrate this point, let’s consider our real use case with modified numbers involving an onboarding test for new users aimed at improving the opt in rate. Initially, when we conducted a regular statistical test, the results appeared to be positive. However, when we applied the ML VR method, the results turned out to be neutral. To investigate this discrepancy further, we examined the feature importance output from the ML model, which is available since we are using gradient boosting algorithm CatBoost.

Notably, platform feature exhibited significantly higher importance compared to other features. Therefore, we decided to analyze the results by breaking them down based on the platform level:

We observed that when we grouped the results by platform there was no real difference in opt in rates between test and control groups. However, when we applied a regular statistical test to the entire dataset, the significant results were cause by the following factors:

  • opt-in rates for Android were higher than for iOS
  • by random chance, the test group had more Android users and fewer iOS users compared to the control group

Fortunately ML based approach automatically captured pre-bias caused by different share of Android users and correct for that. It may seem sensible to examine the distribution by platform in the case of opt in rates experiment since it is known that Android and iOS have different rates. At the same time pre-biases can be caused by various covariates such as gender, age, marketing channel, region, etc. and it is challenging to check and adjust for pre-biases across all features especially when running hundreds of experiments which is usually the case for product companies. The ML VR approach addresses this issue automatically as long as the covariate is added as a feature to the ML model. It is important to note that certain covariates should not be included in the ML model if we believe the experiment could directly affect them (for example, if we treat Android and iOS users in different way in the experiment, platform should not be included as a covariate). Including such features could introduce a backdoor path and lead to incorrect estimation of the treatment effect. Careful consideration must be given to feature selection to ensure the validity of the results.

Conclusion

The ML based method for variance reduction proves to be highly beneficial for new users experiments. By incorporating weak available features, this method enhances the test’s power and reduces the required sample size by 5%-20%. Additionally, it automatically adjusts for pre-biases, which is a challenging and important task in experimentation.

The application of this method is not limited to new users alone; it also yields excellent results for existing users’ experiments. Due to its non-linear nature and ability to incorporate multiple features, it outperforms CUPED in terms of variance reduction. At SmartNews, we leverage this ML-based method in our experimentation platform for both new and existing users.

--

--