The Significance of A/B Testing and Power Analysis in Fraud Detection

Vered Shapovalov
Riskified Tech
Published in
8 min readFeb 22, 2023


Evaluating new models is an integral part of data science work. It provides us with the means to ensure we are doing things right and justifies the long and costly model training process. However, there is no one way to evaluate models that is always correct.

In this post, I will focus on the fraud detection domain, specifically on cases where we constantly retrain and replace multiple models.

In the first part, I will present possible approaches for collecting data to compare models and discuss their advantages and disadvantages in the context of fraud detection. I will show that A/B testing has powerful advantages over other options.

In the second part, I will address cases where we cannot afford to perform A/B testing for all model replacements. I will present a method called power analysis that can be used as a foundation for a methodology to choose the best candidates for A/B testing. I will also discuss the parameters of the power analysis and how we can specify their values.

Collecting data

Option 1: Using historical data

Using historical data to compare models is a straightforward and relatively inexpensive approach, but it has its limitations.

  • Unknown labels: We base our historical data on the old model’s decisions. In cases where the old model decides that an order is fraudulent, we decline the order. However, we rarely have real-world feedback to determine if the decision to decline was correct. Hence, we lack labels for an important part of our data; orders suspected of being fraudulent. When the new model disagrees and wants to approve previously declined orders, it is challenging to assess the correlation of this decision.
  • Real-life configurations: At Riskified, as in other ML-based decision engines, the ML model does not stand alone. It is part of a complete system that helps us make the right decisions. It is very challenging to recreate this system on historical data and mimic the real-life performance of the new model.

Option 2: Data before and after model replacement

To overcome the two disadvantages of option 1, we need to compare the models in production. In this option, we evaluate the old model’s performance on data before the replacement and the new model’s performance on data after the replacement.

In this manner, we do not deal with unknown labels and have real-life configurations on both models. However, when comparing models, we want to compare “apples to apples”. Next, I will demonstrate with an example that comparing models using data collected at different times can be problematic.

Let y be the status of an order, where y = 1 is a fraudulent order and y = 0 is a good order. Let be the model’s decision regarding the order ( = 1 decline the order and =0 approve the order) and let X be the features in the model.

Suppose that the metric we try to optimize is the ratio of losses due to false negatives (i.e., ŷ=0 | y=1) to revenues due to true negatives (i.e., ŷ = 0 | y = 0):

Here, is the number of fraudulent orders in the population, and N(y=0) is the number of good orders in the population. The above metric can decrease simply by increasing the number of good orders, without any changes to the performance of the model (the performance of the model is expressed by P(ŷ = 0 | y =1) and P(ŷ = 0 | y =0) ).

Option 3: A/B testing

In an A/B test scenario, we run both models simultaneously in production. We split the orders between the old and new models in a randomized fashion. In this manner, we avoid the pitfalls of the previous two options.

As both models are in production, we don’t have unknown labels and use real-life configurations. In addition, because the models run simultaneously, we exclude the effect of a changing environment over time. These advantages are powerful, making A/B testing the preferable option.

However, A/B testing has a big drawback; it is an expensive tool. Maintaining an extra model in production is complicated, expensive, and might expose us to mistakes. Another potential loss comes from the fact that we are preserving the old model for an additional period. However, this model might be outdated or missing new features (this is why we want to replace it in the first place).

A/B test when training many models

A/B testing has powerful advantages over the other options, but it is the most expensive method. If one has a small number of model replacements, I suggest using the A/B test option every time.

However, at Riskified, the number of model replacements is far from small. This is attributed to two features of our domain:

(1) Frequent shifts in data distribution: Fraudulent behaviors are constantly changing and improving. To combat this, we are forced to adapt regularly and rapidly by continually retraining our models on up-to-date data, and replacing our old models with new ones.

(2) Multiple models for multiple subpopulations: Riskified’s population includes dozens of subpopulations, each with its own unique characteristics and a dedicated model. This means we are regularly retraining, evaluating, and replacing dozens of models. We replace and evaluate multiple models, but we can’t afford to perform A/B testing on so many replacements. Hence, we must develop a technique that guides us toward which model replacements are the best choice for A/B testing.

A tool that can help us with that is power analysis.

Power analysis

As previously established, one should choose with caution which model replacement to evaluate using the A/B test option. We suggest developing a methodology based on power analysis for narrowing down the number of candidates or for ranking them. So what is power analysis?

Power analysis is a way to estimate how well our tool detects the minimal difference of interest. A good allegory for power analysis is a puzzle made of four pieces:

(1) Statistical power (a.k.a 1-β)

(2) Significance level (a.k.a α)

(3) Sample size

(4) Effect size

The puzzle has three degrees of freedom. You are free to vary three pieces, but once their values are set, a power analysis tool can calculate the value of the fourth piece. For example, if you provide statistical power, significance level, and sample size, the power analysis tool can calculate the minimal effect size that can be detected by the statistical test (given that the effect exists 🙂). If you provide statistical power, significance level, and minimal effect size, the power analysis tool can calculate the sample size you need.

Power analysis in practice: specifying values for power analysis parameters

When deciding on statistical power and significance level, it is customary to choose 5% for the significance level and 80% for the statistical power. In the following, we will assume that the significance and power were assigned these values and discuss sample size and effect size.

Calculating effect size for given sample size:

In many cases, the sample size is a function of operations or logistics. For example, if the duration of your A/B tests is limited to one month, then your monthly volume is an upper bound on your sample size.

If we have a value for the sample size (on top of values for alpha and power as discussed above), we can use power analysis to calculate the minimal effect size that can be discovered in a statistical test.

Knowing the minimal effect size enables us to eliminate candidates whose minimal effect is too high, where “too high” is a number that depends on the case at hand. Note that the volume changes between sub-populations; clearly, models of sub-populations with high volume will have an advantage here.

A note on effect size:

Effect size can be many things; it depends on the statistical problem at hand. If we are in a regression problem, the effect size can be the correlation between X and Y. If we perform a t-test to establish the significance of the difference between means, it is common to use the standardized mean difference (SMD), i.e., the difference between means divided by the standard deviation (e.g., Cohen’s d, Glass’s Delta).

Calculating sample size for a given effect size:

Deciding on the effect size can be challenging. In some cases, you can estimate the expected effect size from your experience with the data, your experience training and validating the new model. In other cases, you can use the costs of the training and model replacement process as the minimal effect size (here, you demand a financial justification for replacing the old model). In other cases, the management gives the minimal effect size you need to achieve (for example, a new model needs to perform at least 5% better than the old one).

If you provide the required statistical power, significance level, and effect size, then the power analysis tool can calculate the smallest sample size needed for each of the candidate A/B tests. Ideally, we prefer to do A/B tests that require smaller sample sizes.

After you have the outputs for all the candidates for A/B testing, you can:

(1) Rank models by their minimal sample size — the best candidate has the smallest sample size.

(2) Eliminate candidates with an impossible-to-achieve sample size.

A note on sample size:

Many times, the interesting value is the duration of the A/B test rather than the sample size. To report the shortest duration, we need to transform the sample size to time using the volume. As we don’t know the volume during the A/B test in advance, we need to estimate it from historical data.

When using historical data, one needs to consider the following: Is the volume constant over time? If so, then we can easily extrapolate past values into the future. However, usually, this is not the case. We need to consider changes over time (e.g., volume increases with time) and seasonality (e.g., higher volume during specific periods of the year, like the holiday season).

Some final words

As a final remark in this blog, I want to discuss a point that the data science team must convey to the management to avoid false expectations. A significant statistical test is not guaranteed even if you apply the power analysis tool and choose the model replacement in a rigorous methodology.

First, you can find an effect only if it exists. If the new model is not better than the old model, the only way to have a significant statistical test is by making a type 1 error.

Second, even if the statistical power is high (e.g., 80%), we still have a non-zero probability (e.g., 20%) of not finding the effect (i.e., the statistical test will not be significant). A power of 80% implies a 20% (1 in 5 cases) chance that the statistical test will be insignificant even if the new model is better than the old model.