As Don Draper would say, advertising can be many things. It is a dream, it is an experience, it is a promise. But more prosaically, advertising is first an investment. As any investors, advertisers expect value for the money.
At Criteo, we are committed to prove and measure the added value brought by our solution. Compared to more traditional channels, programmatic display advertising offers the opportunity to actually measure the actual value you can get for a specific amount of money you would spend.
Attribution has long been used as a standard to compare the different marketing channels one can use. These predetermined offline rules are easy to read, easy to share, and easy to steer. But, as advertisers are getting more and more sophisticated, they rightfully start to challenge these rules, defined 20 years ago, which are not adapted to today’s shopper journeys and habits. To overcome these rules, one can decide to ignore them, and simply observe what happens on their website (how much traffic, engagement, conversion, revenue, etc. do one loses) while shutting down a particular channel. This observed difference is what is colloquially called “incremental impact”.
But beware, as this simple, down-to-earth concept actually encompasses very subtle questions that need to be addressed if one wants to estimate incrementality properly. In this blog post — the first of a series about this exciting topic — we will go over the details about how Criteo is measuring incrementality through AB testing, the different measurement protocols and discuss the pros and cons of each.
AB-test methodology at Criteo
In this section, we will do a recap on how Criteo splits its users into several populations when an AB-test (or an incremental AB-test) is occurring.
Broadly speaking, at Criteo a user is defined by its “user_id” which is simply a GUID (Cf. https://fr.wikipedia.org/wiki/Globally_Unique_Identifier),
this GUID is unique per user and stays the same across time unless the user chooses to opt-out of Criteo.
If we want to launch an AB-test, we start by defining the percentage of users we want to expose to the “test” group. For the remaining users, Criteo behavior will not be modified and we call this group the “control” group.
For simplification, let’s say this split is between control and test groups is 50%-50%.
When an ab-test is launched, a new identifier for it is generated. For each user, we then compute the hash of the pair (user id, ab test id), and return its modulo 2. This value (either 0 or 1) tells us whether the user is in “control” group or “test”
group, which allows us to choose what has to change given the population of the user (for instance, different banners for users in the test group).
For an incremental AB-test, the change of behavior is simple: when a user is in the “control” group, we simply don’t show him an ad (or most likely, we will show him an ad for the best next campaign not running an incremental ab-test).
This methodology also allows us to do several overlapping AB-tests at the same time for different components: a user can be in the “control” group for a given test and in the “test” group for the other unrelated test. Both AB-tests can be analyzed independently with no bias.
Given the simplicity of this method, we can ask ourselves if this method has any bias and how robust to noise it is.
On the question of bias: as the user id is randomly and uniformly generated for a given user, and due to the properties of the hash function, each user indeed has a 50–50 chance to be in either “prod” or “test” population group.
For the question of robustness, we can ask the following question: knowing that some rare users exhibit different behavior, what is the likelihood that these users are non-evenly split across prod and test groups?
To answer this, we can do some simulation :
- simulate n users overall.
- dispatch them with a 50% chance in either prod or test.
- a small fraction p of all the users has a special behavior.
- we compute r, the ratio of special users in prod divided by special users in test.
We then analyze the distribution of r according to n and p. The expected value of r is 1, but we are interested in how much it deviates from its mean.
At the scale at which Criteo operates, we can consider that n = 1M for a moderately sized campaign and n = 10M for a big campaign. Let’s use p = 0.1 for simulation purposes.
Let’s define by “fair” or “evenly split” an experiment for which r is between 0.995 and 1.005.
As the simulation shows, for 10M exposed users, we have more than 99% probability that the generated split is “fair”.
When the split is not 50–50, we cannot directly compare the two population, we need to reweigh what is observed. There are plenty of ways to do so. For example, if the split is 80–20, it is really tempting to duplicate 4 times the second population, but this is wrong. Why? The total population should stay the same, otherwise statistical tests would be biased. If we only duplicate the second population 4 times, the total population would be 80 + (20 * 4) = 160 and not 100. As the threshold of statistical tests depends on the total population this change would modify the estimation of the significance of the tests. Instead, each population should be multiplied by 0.5/p where p is the probability of being in that population.
Measure incrementality: ITT v. Ghost Bids methodology
There are many variations in the process to measure the level of incrementality. Criteo typically uses two, which are detailed below: Intent-to-treat (or tricks) aka. ITT and the Ghost Bids methodology.
Intent-to-treat considers all users who have ever visited the website and/or app during the period of the test. Indeed, all users who have expressed intent by visiting at least one page of the client website or app are eligible to retargeting. Each eligible user is assigned to either population A (exposable) or B (non-exposable). We thus consider all sales happening on-site.
Users in A will be retargeted by Criteo as usual whereas users in B won’t be retargeted at all.
Among all the users who visit the advertiser website or app, many of them won’t be retargeted or won’t see any display. The causal effect we have on users happens through exposure to displays. If we don’t actually expose a user, we know that our impact is going to be zero.
Thus, a method to measure our incrementality is to focus on users who were actually exposed (prime populations in the diagram below).
However, by nature, no users are exposed in the non-exposable population. Thus, there is nothing to compare the exposed users in A’ to.
We thus use the closest proxy which can be fairly estimated in both exposable and non-exposable populations: we consider a user part of the Ghost Bids population (A’’ or B’’) if and only if he has won an internal auction.
Campaign selection process
Here is a quick explanation about Criteo chooses which campaign to show for a user, which is a concept required to understand the Ghost Bids methodology.
When a user browses a publisher’s website/app, an auction for an ad is run by an ad-exchange: this exchange asks all participants (including Criteo, and its competitors) how much they are willing to bid for displaying their ad. They then select the highest bidder and bill the winner a value related to its bid and to the bid of the other competitors . Internally, Criteo has to decide how much they are willing to bid for this display opportunity for this user, and for which ad campaign. In order to do this, we list all campaigns the user is eligible to and pick the best . This campaign is the only one for which we will try to make a display, and so we will place a bid for this campaign in the ad-exchange. However at this point, we don’t know yet if we will win the external auction or not, and so we don’t know yet if we will actually make a display or not. In the case of an incremental AB-test for a campaign, if the user is part of the non-exposed population for the picked campaign, we won’t actually place an external bid for this campaign; we will pick the second campaign instead and place the external bid for this one. As such, it is never possible to know if we would have actually won the auction for the first pick, and so it is impossible to know if we would have shown an ad for this campaign to this user. This means that, while we can define a set of users exposed to this campaign in the test group, we cannot define the set of “would be exposed” users in the control group. This explains why, if we want to use a more precise method of measure than ITT, we need to use the Ghost Bids method.
By definition, all conversions happen in the ITT population (opt: as all users who have converted have necessarily visited the webpage or app of the partner). The prime (A’B’) and double prime (A”B”) populations are not strictly included in the ITT populations (AB) over a limited period of observation.
As an example, the following timeline is one of a user who is in the ghost bid population (A’’/B’’) but not in the ITT one.
In this case, the user had several events on-site before the beginning of the test. Thus the user is eligible to be retargeted during the test period. During the test period, he wins multiple internal auctions. Thus he is part of the Ghost Bids population. Though, he never clicks on any potential display and never goes back to the advertiser website /app anyway. He thus never enters the ITT population.
Based on past incrementality AB-test, over a test period of four weeks, the intersection between the populations is as follows:
Ghost Bids details
The definition of the populations using the Ghost Bids methodology entails that:
- We don’t consider all on-site sales anymore but keep only the ones from users who won at least one internal auction. This is a basic form of attribution where we would tie sales to internal auctions. We filter the sales in the same manner in both populations, thus making sure that the comparison remains fair.
- The population in focus is defined by the bidder: contrary to ITT, populations in Ghost Bids depend on the internal bidder of Criteo. Indeed, a user only enters the A” or B” population if he won at least one internal auction (the bidder being the same for the two iAB test populations, the comparison remains fair). This is a major difference we should be careful about in our analyses. This has strong implications, particularly when it comes to crossing this incremental AB with another technical AB test (see diagram below) in order to decide which version is more incremental.
Our bidder takes many variables into account in order to set the right bid at each moment for each user, including the number of past displays . The fact that we suppress displays for users in population B will thus impact the bidding for these users compared to population A and eventually impact the probability of winning the auction (which is the ticket to be part of the Ghost Bids population). We will cover this in detail in a future blog post.
Comparison of the two protocols
What is an uplift
Both ITT and Ghost Bid aim at measuring the uplift, i.e. the number of buyers that made a purchase because of the ads. The problem is that among the unexposed population, some users were not even reachable by ads, for example, because they were not active before the test. Thus, we want to filter out users we couldn’t have exposed to ads . The causal impact of our action can only be observed comparing users we exposed to users we could have exposed.
However, we have to be careful, as figures like percentage of uplift, i.e., the proportion of buyers that are incremental, might be artificially higher. Indeed, filtering out users reduces the noise but also the total population, thus increasing the ratio. We can illustrate this by an example. Assume we want to compare two ad policies A and B on a website that has 200 buyers out of 10,000 users without ads. Policy A showed ads to all users and made 20 more buyers, so an increase of 10% buyers. Policy B displayed ads to only 10% of users and buyers, it made 10 more buyers, so an increase of 5% buyers if we consider the total number of buyers but 50% if we consider the buyers reached by ads. The magnitude of the effect can look better by changing the definition and reducing the total considered as a basis.
We compare both protocols to determine if one is more efficient at detecting if a policy is better than another. For simplicity, we call A the exposed population and B the control population and assume that the population split is 50–50. We want to know if there are significantly more buyers on A or B.
As the total number of buyers is also a random variable, we focus here on the proportion of buyers that are in population B. The null hypothesis, H0, is that being a buyer is independent of the population, so the proportion of buyers that are in population B should be around 0.5.
We generated 500 synthetic datasets of 1,000,000 users, with the same probability of being a buyer in A and B and plotted the histogram of this ratio distribution with both protocols. The vertical lines show the p-value=0.1 on both sides of the green histogram. If a sample is on the left (resp. right) side of the line it means that A (resp. B) is significantly better than B (resp. A). A p-value of 0.1 means that following the hypothesis H0, 10% of the samples are considered significant for A over B and another 10% for B over A, which represents 20% of false positive (we think there is a difference between A and B). This is higher than the traditional 0.05 but we need to make a trade-off between the false detection rate and the true detection rate (i.e. the probability to detect when there is indeed an effect). Commonly the signal is low and hard to detect with smaller p-values.
We try different magnitudes of buyers uplift between A and B (more buyers in A rather than B). For each, we generated 500 synthetic datasets of 1,000,000 users, with a different probability of being a buyer in A and B. And we plotted the same histograms for both protocols. Samples that are on the left of the red line are true positives (datasets are generated with a distribution that in expectation has more buyers in A). For each magnitude of uplift, we plotted the histograms of the two protocols (blue) with the histograms under H0 (green, when A and B have the same statistics). Below are the histograms of both protocols for the same uplift magnitude.
For both protocols, we plotted the number of true positives (part of the blue histogram on the left of the red line) according to the uplift magnitude. The dotted orange line shows the uplift magnitude used for the histograms above.
Ghost Bid has a better significance as it always discovers more true positives when we set the false-positive discovery rate to 10% (green histogram on the left of the red line) for both protocols.
Pros and cons of each method
Both methods make it possible to compare two huge populations (in theory two halves of the world) by filtering out users. However, they both have the unavoidable drawback of including more and more users as the test goes on. The filters must begin with the test and once users are selected by a filter they are never removed.
ITT: We only focus on users that did a visit during the test.
- We consider all buyers, even organic buyers (users that buy even without seeing an ad).
- We can compare to different bidding policies directly.
- We can compute the percentage of additional buyers or sales.
- Easy to implement.
- Visits are influenced by ads, the proportion of users in each population won’t be the same as the theoretical proportion.
- Noise from organic buyers is not removed.
- We can only look at onsite events (sells or visits), it is impossible to compare click or cost.
- We cannot use ratio.
Ghost Bid: We only focus on users for which at least one campaign on AB-test was selected during the test.
- We get rid of part of the organic buyers.
- Significance is better.
- As far as the internal bid is unchanged between populations, the observed split is the theoretical split.
- More complex to implement (small bugs or delays can introduce biases).
- We can not compute the percentage of additional buyers or sales (because we don’t look at all sales).
- Winning the internal auction depends on the bidding policy, we have to be very careful if we want to test changes on the bidder. It is impossible to compare two different bidding policies directly with this protocol.
- Given that the bidding algorithm is the same on both populations , users of the population are indistinguishable before the internal selection. Once a tested campaign is selected as the best campaign, users of the two populations start to be different. Indeed users in the exposable population might see ana ad for this campaign whereas users from the non-exposable population will not. It is crucial to track selected campaigns from the beginning of the test, otherwise, populations won’t be comparable.
How to use these protocols
Both protocols keep the signal: incremental buyers are in ITT (they did a visit) and in Ghost Bid (they saw an ad), so in theory, uplift can be measured with both methods. In general, Ghost Bid is more significant as it filters out organic buyers, so it is less sensitive to noise. However, Ghost Bid is more difficult to use. In particular, if the two populations do not have the same bidding policy, then we cannot compare the number of sales, because the internal selection (A”/B”) will have different effects with different bidding policies. Using both seems safer . ITT can be used to compare the number of sales in both populations while Ghost Bid can handle the number of sales and other things such as cost.
It is not recommended to use metrics as “number of buyers per 1,000 users” in any protocols as the number of users and buyers depends on the protocol used. However, the difference of buyers between the two populations and the proportion of buyers that are in a given population are meaningful metrics.
As basic and essential as the incrementality question seems to be, answering it earnestly is another matter entirely. At Criteo, we remain committed to our goal of measurability and performance. This is why we continually work on understanding and improving our measurement capabilities to propose the best option for our clients based on their use cases.
Our R&D team explores the yet untouched wilderness of incrementality measurement and optimization. We will report on the status of our discoveries using future blog posts. We will dig deeper into what these various methodologies entail, we will discuss the relevance of additional pre-filtering, geographical splits, and much more! Stay tuned!