Privacy Sandbox Market Testing

A statistical model for measuring performance impacts

Caio Camilli
Criteo R&D Blog
16 min readDec 5, 2023

--

Photo by Edge2Edge Media on Unsplash

TL;DR

In this article, we introduce a statistical model to estimate the measuring capacities of any DSP (Demand-Side Platform) during the upcoming Privacy Sandbox Market Testing and also discuss some conditions and underlying assumptions that are crucial for its success. The main takeaways are:

  • To measure the performance impact, it is essential to keep average spend per user constant between the treatment and control populations;
  • With an 8-week test period and a perfect setup, Criteo can confidently measure impacts of -7% with a 0.75% treatment population (Mode B) and impacts as low as -2.5% with the maximum treatment population of 8.25% (Mode A + Mode B);
  • Most DSPs will likely be able to measure drops greater than 20% with Mode B alone and greater than 6% with the maximum treatment population

Check all of our articles about Privacy Sandbox here.

1. Introduction

1.1 Privacy Sandbox and Market Testing

Google plans to phase out third-party cookies (3PCs) in their Chrome browser by 2024 and has been working jointly with Criteo and other actors in the programmatic advertising industry since 2019 to find a replacement solution, in an initiative called Privacy Sandbox for the web.

Since 2021, the UK’s Competition & Markets Authority (CMA) has also been involved to ensure both user privacy and market competition are protected. In June 2023, it released a note on market testing advising stakeholders on a timeline and approach to make an assessment of Privacy Sandbox tools by end of June 2024. Based on the results, the CMA will inform its decision to allow Chrome removal of third-party cookies in the UK, which Google will apply worldwide.

As the world’s leading Commerce Media Platform, Criteo is taking part in the testing procedures and is in a privileged position to give appreciations due to its active involvement in the Privacy Sandbox since the start, as well as its scale and technical capabilities.

1.2 Testing conditions

Three test populations will be used, as defined by the CMA:

  • Treatment group: ads served using the Privacy Sandbox APIs, without 3PCs;
  • Control group 1: ads served using 3PCs and without data related to new APIs (business as usual);
  • Control group 2: ads served removing data related to both 3PCs and the new APIs

These populations will be grouped based on two Modes:

  • Mode A: In this mode, 3PCs will still be available across groups but Chrome will assign a portion of traffic (up to 8.5%) to treatment and control group 1 and provide ad techs with labels telling them to which group each user belongs

The percentage of total users assigned to the treatment group under Mode A can go up to 7.5%

  • Mode B: Chrome will deprecate 3PCs for 1% of traffic globally (to form the treatment group). 25% of these will also have Privacy Sandbox relevance and measurement APIs disabled (to form control group 2)

The percentage of total users assigned to the treatment group under Mode B will thus be equal to 0.75%

Mode A begins in Q4 2023. It requires industry coordination, as DSPs need to simultaneously decide ignoring 3PCs and use the Privacy Sandbox APIs on the treatment group for results to be representative.

Mode B begins in January 2024. As DSPs will not have the choice of using 3PCs, its results are guaranteed to be as unbiased as possible (see Practical considerations section for more details).

In summary, it will be possible to assess the Privacy Sandbox APIs on a percentage of traffic varying from 0.75% (Mode B only) to 8.25% (Mode A + Mode B, depending on market coordination).

Illustrating the testing populations (drawing not to scale)

1.3 Problem to be solved and KPI of interest

One important question for any partaking DSP concerns its capacity of distinguishing true impacts from noise, given the test conditions. Indeed, real data is subject to random fluctuations which disturb measurements and may even change the conclusions without proper acknowledgement and treatment.

In this article, we develop a statistical framework that can be used by a DSP to anticipate its own measuring capacity during the Privacy Sandbox Market Testing. Then we extend it to extrapolate to what any other DSPs might see as a function of their ad spend.

While the framework is general and could be adapted to several industry KPIs, in this article we focus in conversions per dollar.

The specific question we will address is: what is the minimum drop in conversions per dollar that a DSP can confidently measure in eight weeks, depending on its ad spend and the size of the treatment group?

1.4 Article structure

We begin by detailing the statistical framework and the modelling choices made, as well as some hypotheses related to the competitive market and the behavior of business KPIs.

Then, we apply the framework using Criteo data to estimate what ourselves and other DSPs might see, depending on their share of ad spend compared to Criteo’s.

Some technical details, formulas and secondary analyses supporting the methodology can be found in the Appendix, destined for readers with a stronger background in Statistics.

1.5 Acknowledgement

Many thanks to Victor LE MAISTRE who created the first version of this analysis and presented it to Google in September 2023.

2. Methodology

2.1 Statistical framework

2.1.1 Hypothesis testing

We place ourselves in the framework of statistical hypothesis testing, with the Privacy Sandbox Market Testing as the experiment (AB test). To be concise, we assume the reader has some knowledge about hypothesis tests, otherwise this is a great resource to get started.

We will test the following hypotheses:

  • H0 (null): “replacing 3PCs with the Privacy Sandbox APIs leads to no change in conversions per dollar“;
  • H1 (alternative): “replacing 3PCs with the Privacy Sandbox APIs leads to a drop in conversions per dollar“

The test statistic will be the relative conversions per dollar (CPD) ratio:

We reject the null hypothesis if the observed relative CPD ratio is smaller than a critical value.

To compute the critical value, we will:

  • Assume the test statistic follows a given distribution under H0 (as we will see later, the Beta prime distribution);
  • Fit this distribution using empirical data by simulating 1000 AA tests;
  • Take its 5th percentile to ensure 95% statistical significance
Illustration of how to find the critical value for the hypothesis test

2.1.2 Type I and II errors

The errors we can make while applying a hypothesis test can be divided in two categories:

  • Type I error: reject the null hypothesis when it is true. The probability of making a type I error is usually denoted α;
  • Type II error: accept the null hypothesis when it is false. The probability of making a type II error is usually denoted β

We call (1-α) the significance level of a test, and (1-β) its power.

There is a tradeoff between significance and power: if we reduce the size of the critical region, it becomes less likely to reject the null hypothesis, which reduces the probability of a type I error but increases the probability of a type II error.

In practice, α is easy to compute because the null hypothesis is well-defined, while in order to compute β it is necessary to detail the alternative hypothesis (for example, assume the true effect follows a precise distribution and then compute the probability of it falling outside the critical region, see below).

Significance and power of a hypothesis test

2.1.3 Introducing the Minimum Detectable Downlift

Now that we gave an overview of hypothesis testing and type I and II errors, we are ready to define the main metric of interest of this analysis, which is called Minimum Detectable Downlift (MDD):

The MDD is the smallest true decrease in conversions per dollar that can be detected with 95% significance and at least 80% power by a given DSP

The MDD will not be the measurement obtained during the AB test, it is a limit value: if the expected impact of replacing 3PCs with the Privacy Sandbox is at least as high as the MDD, there is at least 80% chance that a classical hypothesis test will be able to distinguish it from the noise and reject the null hypothesis.

Note that this value is DSP-specific, as the amount of noise its measurements will be subject to depends on many endogenous factors.

For a DSP, computing its MDD is important to estimate whether it will be able to see something given the test conditions, or if rather it should ask for a longer test duration or a bigger treatment group.

We set 95% and 80% for significance and power respectively because those are standard values used on the scientific literature.

Computation of the Minimum Detectable Downlift

2.2 Modelling choices

2.2.1 Total conversions

From a DSP perspective, the total number of conversions is a random variable as it depends on several factors such as auction results and individual user behavior which are not possible to perfectly predict.

Starting from a simple user model where the number of conversions per group of M users follows a Poisson distribution with parameter (λi), where λi is drawn from an Exponential distribution of scale θ > 1, it is possible to show (see Appendix) that the number of sales in a population with k groups of M users can be approximated with a Gamma(k, θ) distribution.

This is very useful for our simulations because under this scenario, k is proportional to the population size and θ is proportional to the expected number of conversions per user. If a population following this law is split randomly (as it will be the case in the Privacy Sandbox AB test), each subpopulation follows a Gamma(k*p_group, θ) distribution: θ stays the same and k is scaled down to match the subpopulation size relative to the whole!

Some examples of probability density functions for a Gamma random variable

2.2.2 Relative conversions uplift

We have seen that we can model conversions with a Gamma distribution, and that our uplift can be expressed as the ratio of the number of conversions between the control and treatment groups, rescaled by the ratio of their sizes.

We can show (see Appendix) that if the total number of sales follows a Gamma(k, θ) distribution, then under the null hypothesis the CPD ratio follows a Beta prime distribution depending only on k and the treatment group size r.

The null hypothesis setup is what we call an AA test: the population is split randomly in two groups without any special treatment. In practice there will still be measurement differences between them, but we know they are 100% attributable to statistical noise.

Below we plot some examples of this distribution with k = 10000, for several treatment group sizes. As one might expect intuitively, the closest to a 50%/50% split, the less noisy this distribution will be (lower variance).

Influence of the treatment group size

2.2.3 Modelling the impact of 3PC deprecation

Here we assume that the impact of deprecating third-party cookies in terms of capturing sales will be equivalent to multiplying the conversions random variable in the treatment group by a factor of M, with 0 < M < 1.

Using this, we can show (see Appendix) that the AB test metric will follow a Beta prime distribution that depends on k, M and the ratio of population sizes.

Influence of the uplift metric, defined as (M-1)

2.2.4 Projecting impacts to other DSPs

We assume that clients expectations in terms of conversions per dollar will be the same for all DSPs. This means that the total number of driven sales for a DSP will be directly proportional to its ad spend.

As before we model sales as a Gamma(k_DSP, θ_DSP) random variable. We are able to measure the parameters for Criteo and we need a model to extrapolate to other DSPs.

We will assume the total number of driven sales will be equal to the expectation of the Gamma law, which is given by k_DSP * θ_DSP. Thus we have:

Intuitively, driving fewer sales comes from a combination of two factors:

  • Reaching fewer users (this is governed by k);
  • Driving fewer sales per user (this is governed by θ)

We introduce a new parameter A (0 <= A <= 1) that controls this impact repartition. We then have:

The parameter A is the elasticity of user reach to the ad spend and can be interpreted (roughly) as: increasing/decreasing spend by 10% will lead to reaching (10 * A)% more/fewer users.

Note that if A = 0, the size of a DSP has no impact in the noise levels it experiences. The higher the A, the more noisy a distribution will be for a smaller DSP.

Influence of the parameter A

2.3 Practical considerations

2.3.1 Test results vs. impact of a full roll-out

With market testing we hope to estimate the true impact of replacing 3PCs by Privacy Sandbox APIs on 100% of Chrome traffic. The results obtained will not be a good proxy of it if:

  • The Privacy Sandbox APIs made available change significantly between testing and full roll-out;
  • The DSPs and SSPs that participate do not add up to a significant market share of the industry;
  • Market coordination to use the labels provided with Mode A is flawed

We trust the test stakeholders will be attentive to these and other points related to making sure the Privacy Sandbox Market Testing gives a reliable and accurate assessment of what will happen if Google goes forward with 3PCs deprecation.

2.3.2 Tradeoff between ROI and volume

As stated before, the main KPI of interest for each group is conversions per dollar (CPD). In the advertising industry, there is a known tradeoff between volume and ROI (diminishing returns): for instance if a DSP reduces its total ad spend by 50%, it is very likely that it will be able to improve its ROI, at least marginally. Note that for clients interested in driving onsite conversions, the ROI is the reciprocal of conversions per dollar.

Thus, to really assess the performance impact of the Privacy Sandbox, average ad spend per user should be kept constant between the test and control groups. This also means that a drop in conversions per dollar will be the same as a drop in total conversions (renormalized by group size).

2.3.3 Measuring the test statistic

To estimate the drop in conversions per dollar brought by the deprecation of third-party cookies, we will measure the total number of conversions for users inside the treatment group (no 3PCs, Privacy Sandbox APIs available) versus users inside the control group (3PCs available as usual), making sure that the total ad spend on each group is proportional to the share of users belonging to it.

For instance, if the treatment group is 1% of Chrome traffic and the regular control group contains 99% of it, we will aim to spend 99x more in the control group than in the treatment group. Then, we measure the drop in conversions per dollar as:

2.4 Limitations

As with any model, some strong hypotheses were made to simplify a reality which is impossible to describe with a couple of parameters. We believe the biggest generalization gaps lie in:

  • Approximating the relationship between ad spend and reached users with a power function: this works well for small changes but is unlikely to hold for drastic variations of spend;
  • Approximating the impact of the Privacy Sandbox as a multiplier to the number of conversions random variable: this is a simplification equivalent to saying that the number of reached users (k) remains the same but the probability of them converting (θ) decreases;
  • The AA test was performed using data from 2023 with 3PCs, it won’t necessarily reflect user behavior and market competition conditions in 2024 Q1;
  • The Beta prime distribution is a good approximation to the AA test distribution but it is not perfect (more details in the Appendix — Goodness of Fit session);
  • The chosen values for significance and power, although standard in the statistical literature, are arbitrary and influence the final values for MDD

3. Results

3.1 AA tests: noise measurements from Criteo perspective

To estimate the parameters of our AA test distribution, we simulated 1000 random population splits over the Chrome Web browser users. Since there was no test ongoing, we know that in theory there should be no difference between the two populations, thus anything we measure will be due to pure noise.

The result was 1000 values for our drop metric, following its distribution under the null hypothesis since this is what we would observe if the Privacy Sandbox had no impact in conversions per dollar. We fit a Beta prime distribution over these values to get the analytical distribution parameters.

We do this for several proportions simulating distinct treatment group sizes (0.75%, 1%, 2%, 5% and 10%) and test horizons (2 weeks, 1 month, 2 months and 4 months). Below we plot the critical values for each combination of size and horizon.

Remembering the definition of critical value, we need to measure impacts at least as high to be sure (95% confidence level) that we are not merely detecting noise.

We see above that, naturally, the greater the treatment group size and the time horizon, the smaller the critical value (in absolute terms) will be, as the signal-to-noise ratio increases.

3.2 Minimum Detectable Downlift (MDD) projections for Criteo

Once we have our critical values and the null hypothesis distribution, we can input any projected Privacy Sandbox impact to get the AB test distribution, and with it compute the statistical power.

Remembering the definition of Minimum Detectable Downlift, it is the projected impact that gives a statistical power of 80% for a 95% significant critical value. We plot it below, again for each treatment group size and time period.

We see overall very similar relationships for MDDs as we saw for critical values, but notice that MDDs are strictly higher (they would be equal to the critical value if the statistical power target was 50% instead of 80%).

With Mode B setup (0.75% treatment population) and two months, we anticipate that Criteo will be able to confidently detect drops stronger than -7%.

With Mode A fully deployed and Mode B (8.25% treatment population), it can detect drops as low as -2.5%.

3.3 MDD projections for other DSPs (upper bound with A = 0.5)

Using the impact extrapolation hypothesis with A = 0.5, we modify the AA test and AB test distributions to simulate what other DSPs might see as a function of their ad spend (relative to Criteo) and finally compute their projected Minimum Detectable Downlift, for a time horizon of 2 months.

Naturally, the higher the ad spend, the smallest the MDD as the signal-to-noise ratio increases due to low volume. The parameter A is particularly important to govern this relationship, setting it to 0.5 is a bit pessimistic as it likely overestimates the spend-elasticity of reach (increasing noise levels for smaller DSPs), but may be seen as an upper bound.

In any case, under this model, we see that even a DSP 100x smaller than Criteo (blue curve) would be able to detect impacts stronger than -20% in a period of two months using only Mode B (0.75% of total traffic). With the maximum treatment population of 8.25%, we expect it to be able to detect impacts stronger than -6%.

We see that Mode B allows most (if not all) DSPs to timely spot a catastrophic scenario (e.g. -50% drop in conversions per dollar, see for instance this literature review by Dr. Garrett Johnson) and give feedback to Google.

4. Conclusions

In this article, we introduce a statistical model to estimate the measuring capacities of any DSP during the upcoming Privacy Sandbox Market Testing and also discuss some conditions and underlying assumptions that are crucial for its success.

Despite the inherent limitations of statistical modeling, we believe that this framework is well grounded and can serve as a reference for other DSPs to perform their own analyses (which we encourage them to do). With this article, we hope to help the Privacy Sandbox Market Testing stakeholders by giving our data-centric estimations of noise and measurement capabilities.

The main takeaways are:

  • To measure the performance impact, it is essential to keep average spend per user constant between the treatment and control populations;
  • With an 8-week test period and a perfect setup, Criteo can confidently measure impacts of -7% with a 0.75% treatment population (Mode B) and impacts as low as -2.5% with the maximum treatment population of 8.25% (Mode A + Mode B);
  • Most DSPs will likely be able to measure drops higher than 20% with Mode B alone and higher than 6% with the maximum treatment population

A. Appendix

A.1 Total number of sales as a Gamma random variable

A.2 Test statistic distribution under H0: Introducing the Beta prime distribution

A.3 Test statistic distribution under H1 and MDD computation

A.4 Extending the Beta prime to other DSPs

A.5 QQ plots and other goodness-of-fit indicators

A.6 Sensitivity analysis of A parameter

--

--