Relevant and non-relevant users in A/B-testing

Lars Hirsch
5 min readJun 2, 2024

--

A common mistake among many practitioners of A/B-testing, even within very large and sophisticated companies, is failure to identify the relevant set of participants in the experiment.

Here is a common way to execute an A/B-test: (i) For a given user/page request, use a hash function (or similar) to identify users as part of the treatment or control group . (ii) Continue serving the page, providing the appropriate treatment as necessary. (iii) Analyze results based on all users’ group assignments.

This could serve the purpose of some experiments, but for many types of tests it is often suboptimal because such experiment design would lead to significant loss of statistical power. And this is not due to insufficient traffic, instead, this results from a failure to identify which users (participants) are relevant to the experiment. By mixing relevant and irrelevant users, we introduce noise into the data set; and higher variance in the underlying data will lead to greater loss of statistical power .

I will define “relevant users” as users who were eligible to be impacted by the experiment. By eligible I mean that they would have been impacted if they were in a treatment group. Well, would they not by definition be impacted if they are in a treatment group? Not necessarily.

Let’s highlight this using an example. Let’s say we want to analyze the impact to ads quality (which I discussed here) of a new ad retrieval method (NARM), a new mechanism that we invented to identify and retrieve relevant ads, but we want to make sure this method retrieves ads of similar or better quality to the ads we are already showing. We will use NARM in addition to our other seven retrieval methods, not as a replacement. These methods each provide 100 ads, and out of the 700 or so retrieved ads we use a ranking algorithm to select the one ad we will print on the page. We will use click-through rate (CTR, the fraction of impressions that result in a click) as a proxy for quality. Importantly, most sessions will not be impacted by NARM ads since ads from other retrieval methods will most of the time win the ranking.

Now we could consider all sessions for all users, and compare CTR in treatment and control. But given that only a minority of ad impressions will be NARM ads, this is intuitively not ideal. We really want to compare the CTR of NARM ads compared with the ads that were replaced by the NARM ads. Are the NARM ads of similar or better relevance than the ads they replaced? The difference between the NARM ads and the ads they replaced may be drowned out by the CTR variance across all other ads.

We will divide users into a treatment group (where we show NARM ads) and a control group (where we don’t). The key now is to make sure we can identify both the users in the treatment group that actually saw an ad retrieved by NARM, and the users in the control group that would have seen an ad retrieved by NARM if they had been in the treatment group. The former can possibly be achieved through logs (assuming the retrieval method of the printed ad is logged, which is a good idea to do), the latter will likely require special care to enable. So we must take care to save a flag with each user indicating if the user was eligible to be impacted by the treatment (for simplicity of analysis, it’s a good idea to do this both for treatment and control).

This means that we need to enable NARM also for the control group, perform ranking with the NARM ads included, then check if the winning ad was retrieved by NARM. If it was, log that the user was eligible, then rerun ranking without the NARM ads (or select the highest ranked non-NARM ad if that’s faster and would yield the same result).

After the experiment we can estimate the impact of NARM on quality with much higher precision by only including relevant users in the analysis.

Now, this works for quality, what about monetization? Disregarding second order and long term effects on advertisers, (which I discussed here), assuming we use second pricing (the winning ad is priced based on the bid and potentially other properties of the runner-up ad), we need to modify our approach to estimate the monetization impact.

Since pricing will be impacted by a NARM ad in second position, we need to also consider sessions (or users) that were eligible to see a page with a NARM ad in second position. This ad would not be visible on the page, but it will impact the pricing of the winning ad. To calculate impact by NARM on monetization, we would include all users that were eligible to see a page with a NARM ad either in the winning position, or as a runner-up impacting pricing.

In a pay-per-click model, we could be even more strict and only include (i) users that either were eligible to see a NARM ad, or (ii) clicked on an ad that was eligible to be priced based on a NARM ad. This would be better since monetization is only impacted if (i) we’re showing a different ad, or (ii) we charge a different amount for the same ad.

Importantly, these methods exclude all non-relevant users from the experiment analysis, meaning users in treatment who were not impacted by the experiment and users in control that would not have been impacted by the experiment if they had been in the treatment group. Of course, if an experiment would impact all, or almost all, users, you can ignore this because all or most users would be relevant anyway. But in many or perhaps most experiments, this is not the case. And in situations where the relevant users represent a smaller fraction of total users, we stand to gain a lot of statistical power by making sure to only include relevant users.

--

--

Lars Hirsch

I'm a seasoned tech leader passionate about helping ad tech businesses in the US and globally. Ex- Google, Amazon, Snap, Microsoft.