Assessing the Influence of Outliers in A/B Experiments: Quantile Functions and Sensitivity Analysis

Published in

SEEK blog

11 min readMay 25, 2022

At SEEK’s Artificial Intelligence Platform Services (AIPS) we routinely use A/B experiments to measure the impact of introducing new, or modifying existing, products and services.

Most A/B experiments, including our own, are based around comparing the sample mean between the A and B groups. There are good reasons for focusing on the sample mean rather than other statistics, such as the sample median.

The mean is simple and easy to compute. Plus, a comparison of group means is well understood, with vast literature and corresponding software packages readily available to assist with analysis.

One downside of the mean however, is that it is highly sensitive to extreme observations. Such extreme observations are commonly known as “outliers”. In the presence of outliers, inferences based on a comparison of means can be unreliable and potentially lead to incorrect business decisions.

We’ve learnt from our experience at AIPS that these concerns aren’t trivial either, since outliers are common in high-traffic systems such as ours. As such, robust approaches for dealing with outliers are necessary for reliable A/B experimentation.

In this post we describe an approach that we use to assess the effects of outliers on our results. It is based on visualising the quantile function — to detect outliers — combined with sensitivity analysis to assess how the results change in the presence or absence of outliers.

While this approach involves a bit more work than a simple analysis that ignores the influence of outliers, we have found that it provides a richer understanding of the results and ultimately affords better decision making.

TL;DR

Online A/B experiments are often analysed using a comparison of group means
Comparing means between groups can be misleading in the presence of extreme observations, known as “outliers”
Visualising the quantile function in each group (and its difference) can help assess the potential impact of outliers
Progressively re-analysing the data with more outliers removed — and plotting the results — can help assess the sensitivity of the findings.

The sample mean can be misleading in the presence of outliers

As we mentioned above, the mean is highly sensitive to outliers. While this fact is well known, we feel that it may be under appreciated within the context of online A/B experiments.

Consider the following two data sets:

A = [1,2,3,4,5,6,7,8,9,1000]

B = [10,11,12,13,14,15,16,17,18,19]

From this data we note that:

Any data point in B is greater than all but one of the data points in A
The median for B (14.5) is 2.6x larger than that of A (5.5)
Yet the mean of B (also 14.5) is 6.7x smaller than the mean of A (104.5)!

In this example, the data point of 1000 in group A is an outlier — and has a large influence on the sample mean in group A. While this example may seem farfetched, we have come across such extreme scenarios in real experiments.

Causes of extreme outliers in online experiments

In the world of online experimentation, two main causes of such extreme outliers are “bots” and “bugs”.

Bots are non-human visitors to a website and can lead to a very large number of requests or actions being recorded.

Bugs are programming errors that creep into production code and can, in some cases, produce anomalies in visitor behaviour that result in outliers.

We will return to the issue of outliers and their cause at the end of the post. Whatever these causes may be, outliers seem to be the rule rather than the exception in complex, high-traffic online systems. So, robust procedures for dealing with outliers are necessary for reliable A/B experimentation.

Relying on a single outlier removal strategy is not enough

The simplest way to avoid the effects of outliers is to identify and remove them from the dataset prior to the analysis.

At AIPS, we have a library of various outlier removal strategies, each suited to a different situation. Examples of these strategies include removing values above a given threshold, or removing the uppermost percentile(s) of a distribution. Many other methods are available, and we discuss a few at the end of the post.

While these strategies are effective, there is no guarantee that they will lead to reliable conclusions for a given experiment.

As our example above demonstrated, just a handful of outliers can compromise the inferences drawn from an experiment, so it can be risky to trust the results from an analysis where a single outlier removal strategy has been used.

This may be further complicated when the intervention interacts with the outlier generating mechanisms (for example if the experiment involves changes to the data tracking system) or interacts with the outlier removal strategy itself (for example if the removal is based on activity per visitor and the experiment involves a change to how a visitor is tracked or defined).

Because a single outlier removal strategy (or lack of it) cannot ensure the reliability of the experimental results, we carry out additional analyses to help us understand if and how outliers are affecting the results. One such analysis involves visualising the quantile functions and their differences, as we describe below. We have found that this analysis provides a richer understanding of the results and ultimately affords more reliable decision making.

Assessing the influence of outliers: a worked example

Comparing means using a single outlier strategy

To show how the quantile function can help detect and assess the effects of outliers we will use an example motivated by a recent A/B experiment conducted at AIPS.

The bar graph in Figure 1 below depicts the means of the A (Champion) and B (Challenger) groups. Based on the point estimate, the plot suggests that B group is inferior, in this case by ~8.5% in relative terms. However, the confidence intervals are wide, partly due to the relatively small number of samples in this example.

Note that this comparison of means was carried out after our standard outlier removal procedure. Had we relied on this outlier removal strategy, we would have inferred that the challenger (B) was inferior and thus we should retain the current implementation (A).

However, as we will show below, going beyond the comparison of means and using the quantile functions to compare the distributions will demonstrate that the comparison of means is unreliable. We first briefly review the quantile function.

Figure 1. Means of the A (Champion) and B (Challenger) groups after standard outlier removal strategy. Error bars represent 95CI. The relative difference and associated 95CI are shown in the figure title.

Defining the quantile function

Our approach is based on the quantile function to visualize the distributions in each group. Put simply, the quantile function Q takes as input a probability tau (between 0 and 1) and returns the threshold value Q(tau) such that the probability of randomly sampling an observation below this threshold equals tau.

That’s quite a mouthful, but the idea is simple. For example, for tau = 0.25, 0.5 and 0.75 the quantiles Q(0.25), Q(0.5) and Q(0.75) are just the lower quartile, median and upper quartile, respectively. If you are not familiar with quantiles and the quantile function the wiki page and this post are good places to start.

Visualising the quantile function

The quantile functions for each of the two groups in our example are shown in the top panel of Figure 2 below. We can see that the two functions are generally closely matched (the two lines largely overlap). The B group is higher in certain parts, for example for the median and upper quartile (shown by dashed vertical lines).

Notably, there is an uptick in the quantile near the right end of the functions for both groups (enclosed by the red square). Note that the y axis is on the log scale, meaning that the uptick is very large — values around tau = 0.99 (the 99th percentile) are more than 10 times greater than values around tau = 0.75 (the 75th percentile)! This indicates that there is a long upper tail in the distribution for each group, with some very large values in the dataset — a red flag.

Visualising differences in the quantile function

Because the A/B experiments we run at AIPS often have effects in the range of a few percentage points for the relative mean difference, the quantile functions themselves are often very similar between the two groups and this makes it hard to visually pick up differences. We can see this in the figure where the two lines largely overlap.

To deal with this, it is useful to also visualize the difference in quantiles (defined as B (Champion) — A (Challenger)), as shown on the bottom panel of Figure 2. In this plot we can easily see that the difference is mostly above the zero line, indicating that the B group has generally had a positive effect. However, at the very end of the distribution (enclosed by the red square) the difference is highly negative (note the log scale).

Thus, it seems that the original comparison of means — which found that the B group is inferior by ~8.5% in relative terms — could be driven by differences in the very large values in the tails (potential outliers), and thus may not be reflective of the effect of the intervention for most users.

Figure 2. Top: Quantile function for the A (blue) and B(orange) groups. The very large increase at the end of the functions (highlighted by the red square) indicates a long upper tail. Bottom: Difference in the quantile functions from the top panel. The large difference at the end (highlighted by red square) indicate large difference in the tail, potentially due to outliers.

Assessing sensitivity to the outlier removal threshold

The previous plots suggest that outliers may be present in the data and that these could have affected our original comparison of means (Figure 1). But it would be good to be able to quantify the impact these outliers may have had on our results.

To more directly assess this, we can visualise how the difference in means changes as we progressively remove more of these extreme data points.

We start by keeping all the data and conducting the mean comparison, same as the initial bar graph in Figure 1. Then we remove the top 0.5% of observations and repeat the mean comparison. Then we remove an additional 0.5% (same as removing the top 1% from the original data set) and repeat the comparison again, and so on.

The results of this sensitivity analysis are shown in Figure 3 below. The y-axis reflects the relative difference in the means and the shaded area represents the associated 95% confidence intervals. We can see that the initial ~8.5% reduction (marked by the black dashed line) switches to a ~4% uplift after removing just the top 0.5% of observations!

The additional removal of more data has, in comparison, relatively little effect with the estimates remaining in the positive 1–4% range. Note that in this example the confidence intervals remain large (partly due to the relatively small sample size) and this would need to be taken into consideration in any “roll out” decisions.

Together, the visualisation of the quantile function and the sensitivity analysis help us determine whether our result may be affected by outliers (i.e. a large difference in the tails) and help us quantify what effect those outliers may have had (i.e. how the result changes as we progressively remove more of the extreme data points).

This approach is visual, simple and informative. We have generally found that using it affords more reliable decision making. However, there are some important limitations and open questions to keep in mind, as we discuss below.

Figure 3. Sensitivity analysis assessing how the relative difference in means (y axis) changes as we progressively remove more of the extreme data points (x axis). The original relative difference of ~-8.5% changes to ~+4% after removing just the top 0.5% of observations.

Limitations and open questions

Though the procedures above are straightforward and easy to implement, there are some important things to keep in mind when conducting this type of analysis.

Outliers… or just unusual users?

In our sensitivity analysis above we showed that removing just 0.5% of observations completely changed the experiment result.

In some cases, it may be possible to cross check and confirm that this 0.5% indeed reflect non-human users — be it a bot or a bug. But in most cases there is no way to be sure that these extreme observations do not correspond to genuine user activity.

Even if these extreme observations are the activity of human users, they may not reflect the majority of customers and, therefore, we may wish to exclude them from our analyses anyway. From a “roll out” decision perspective, it may be acceptable to exclude some genuine users from the analysis, since we generally aim to improve the experience for the majority of users.

On the other hand, “heavy” users with high activity may be central to the decision making in some experiments. In this case it may be useful to consider data imputation (e.g. Winsorisation) to reduce the influence of those observations, rather than omitting the data all together.

Removing data from one or both ends of the distribution

The sensitivity analysis was focused on removing values (also known as trimming or truncating) from the upper tail or “right” end of the distributions. Another possibility is to remove values from both ends.

Count and conversion metrics in online experiments have non-negative distributions that are often highly skewed which can make removal from both ends trickier. For example, consider data from a Poisson distribution with a rate parameter = 1. Removing values greater than 4 removes the top ~0.5% (P(x>4) = 0.0036). However, on the other end of the distribution we can only go as far as removing zeros, which would remove 37% (P(x=0) = 0.367) of the data.

Workarounds are possible but come with their own complications. Where possible, it may be prudent to inspect the effect of removing data from one as well as both ends of the distribution.

Don’t cherry pick results

The visualisations we shared offer much more information than the original bar graph. One concern is that this additional information could lead to “cherry picking” results.

For example, some may be tempted to only focus on the quantiles or thresholds for which “positive” outcomes are observed and disregard the bigger picture.

From a methodological perspective, this may be alleviated by the construction of confidence intervals of the quantile function together with appropriate correction for multiple comparisons, as discussed in this post by Netflix. For scaling considerations see this post by Spotify.

More generally, at AIPS we try to mitigate the misuse of results by having experimentation data scientists responsible for making sure the nuances and limitations of this (and any other) approach are considered and clearly communicated to all stakeholders.

Summary

Most A/B experiments are analysed using a comparison of group means. Since the group mean is highly sensitive to extreme observations, reliable findings can only be obtained when appropriate methods are used for dealing with outliers.

Here we showed that visualising the quantile function can be useful for detecting outliers and assessing their potential effects on the estimated mean difference. Progressively reanalysing the data with more outliers removed and plotting the results can help quantify the sensitivity of the results to the effects of those outliers.

A major advantage of our approach is that it is visual. Plus, quantiles and percentiles are well understood by members of most data driven organizations, which also helps with communication.

Of course, our approach is not the only way to interrogate the data beyond a simple comparison of means. It can be supplemented by other approaches, such as comparing other statistics or transforming the data. For example, rank-transformed data may be less sensitive to outliers and have the added benefit of increased sensitivity in some cases.

Nonetheless, our approach is visual, simple, and often quite informative — making it a useful tool when analysing and communicating the results of online A/B experiments.