A/B Test Decisions: Reducing Type 1 Errors and Using Elasticity
This post shows how Criteo makes A/B Test decisions that converge to statistical significance in a two-week period while optimising many different business cases and using standard economic models.
There’s literature about performing A/B test experiments and using confidence intervals to qualify the results as statistically significant. Here is an example.
I’ll concentrate instead on how Criteo unifies many business cases into a single metric (Advertiser Value) to keep down the likelihood of Type 1 errors in our decisions and how we use this measure to fit an elasticity curve to foresee different economic possibilities.
All the examples on this page including publishers, advertisers, products, money figures, and margins, are fictitious. They are used with the only intention to make this post easier to read by giving a real-life context the reader can relate to.
To understand the Advertiser Value metric, you need a basic understanding of what Criteo’s engine does and some terminology.
Criteo helps advertisers show ads on the publisher’s sites. A retailer like MyFashionCo creates a campaign in Criteo to show ads of its new season dresses linking back to their online retail page. When internet users browse publisher websites, e.g. MyLocalNews, these sites ask many advertisers for ad bids. Our engine gets such a request, selects a campaign between many, and bids an amount. The bidding winner gets to show the Ad to the internet user who is hopefully interested in the advertiser’s product, clicks on the ad, and buys a product on the advertiser’s website.
Our engine optimizes tens of thousands of advertiser campaigns that seek different actions from internet users and have different budgets and ways to spend them. Some want to sell dresses at 100$ a piece, some want the user to install their application on their phone, some want to build their brand with videos, some have a fixed budget to spend, some want to optimize sales but pay Criteo per click, etc.
We constantly improve our engine, and we launch A/B Tests experiments with the following hypothesis to prove so
Reducing Type 1 Errors
We could measure the effect of a new engine separately for each campaign; however, we’d take too long to collect enough data for each campaign to make the result statistically significant, and the number of false positives (Type 1 errors) would be huge. Even if each campaign measure would have a significance of 0.01%, the likelihood of a Type 1 error between the tens of thousands per campaign measurement would be 100%.
Grouping campaigns by types (sale optimizing, click optimizing, …) would reduce the duration of the A/B test; however, it would not eliminate the Type 1 error problem. I.e. Let’s assume that each measure is done with a 90% confidence interval (10% significance test) and at least ten groups. Ten groups mean ten decisions and a 10% significance means we’d wrongly accept the alternative hypothesis one time out of 10 (type 1 error). Hence, we’d statistically have one type 1 error in each of our A/B tests decision metrics.
We could reduce the chance of Type 1 errors in our hypothesis test by using techniques like the Bonferroni correction, but our measurements would need more samples and time to achieve statistical significance.
Our solution is to have a single metric (advertiser value) that we can apply to any campaign to have a single measurement per A/B Test.
Advertiser Value = Single Metric
In essence: for each action we get a user to do, we sum how much that action is worth for the campaign’s advertiser.
Going back to the dress example and introducing a bit of business terminology; let’s assume that the advertiser wants to sell each dress at 100$ a piece (Sales Amount) and it’s willing to pay up to 10$ in advertisement (Ads Cost) for each dress sale (AKA order) because it keeps 20$ in profit out of each sold dress (Advertiser Margin).
Note that Advertiser Value is different from Advertiser Margin because some campaigns might value each action differently to the margin (e.g., Brand promoting campaigns working at a loss). The only thing we know is how much advertisers are willing to pay for each action; hence, we assume that the Advertiser Value is a multiple of the Ads Cost (α * Ads Cost).
When the dress advertisement campaign runs to sell 5 dresses, we can calculate the total advertiser value for the campaign as follows.
To get a single measure for all the campaigns: we classify campaigns depending on the action the advertiser wants to maximize (clicks, orders, sales amount, application installs, completed video views, etc); we calculate a Value Per Action for each campaign from historical data; we multiply the number of actions generated for each campaign by the campaign’s Value Per Action; and we sum the Advertiser Value of all the campaigns.
How we calculate the Value Per Action of each campaign using historical data is enough material for another post.
When we do an A/B Test, we compare the Advertiser Value for the test (B) and reference (A) engines and provide a single metric to prove or disprove the hypothesis that the new engine is different from the previous one. This way, the likelihood of Type 1 error is just the significance of this single measurement (e.g., 90%), and we converge to statistically significant results in about 2 weeks because we’re using the data of 10s of thousands of campaigns.
Note that it’s not a problem that we track a proxy Advertiser Value if we assume the same alpha-factor applies to all campaigns. The alpha factor cancels out in the metric calculation. This assumption has proven good enough so far for us in real life.
When we do A/B Tests that risk affecting atypical campaign types, we need to ensure that an impact on them is not hidden by the fact they have a limited share of voice in the final measurement. E.g., a -90% Advertiser Value on 10% of the campaigns would be hidden by a +1% effect on the rest of the campaigns.
In that case, we can decide to: take two measures, one overall and another one for the scope we know we risk breaking; or concentrate our A/B Test to a sub-scope of interest. In the former case, we reduce the significance level of our tests so that the overall likelihood of getting a Type 1 error is under control. E.g. If we calculate the above two increments (one for sales and one for the rest) with 10% significance, we get a 20% chance that one of them is wrong; and we can decide to lower the significance of the measures to 5% to have an overall 10% chance of a type 1 error. In the latter case, we specialize the engine to only work on those types of campaigns as we don’t know what its effect is on the other scope.
To make the above measurement more meaningful, we need to consider that advertisers operate following some constraints (budget, ROI = return of investment = 1/Cost of Sale) and will adapt how much they spend on the advertisement when we release a new Engine to meet those constraints.
For our dress example: up until now, the advertiser sold five dresses for 50$ spent in an advertisement at a 10% Cost of Sale (or 10x ROI). Let’s assume our new engine sells a dress with less but more effective Ads at 7$ a piece instead of 10$. Given the new conditions, the advertiser might react as follows:
- Iso-Sales (Lowest Advertiser Value): dresses cannot be produced faster; I need to cut advertisement costs from 50$ to 35$ so I don’t get too many orders.
- Iso-Cost: I can meet the demand, but I don’t have more budget for ads anyway; I’ll get to sell 6–7 dresses for 50$
- Iso-ROI (Highest Advertiser Value): I can meet the demand and I want to get as much return of investment as I was doing before; I’ll increase the advertisement spent until I get to sell a dress for 10$ spent on ads.
The combined Advertiser Value of all campaigns will be some point between the lowest and highest advertiser value options.
Elasticity Curve Of The Market
Before getting down to calculate the numbers for our sample advertiser, we need a bit of context. Our advertiser knows that 70$ will not sell ten dresses even if 35$ sold five dresses with our new engine, it will sell fewer dresses. This has to do with the market in which the advertiser operates. Below is a typically used economic model for a market showing that producing value gets more expensive as we keep investing.
It shows that companies and customers invest in products that have a high margin: they invest first in ways to get value at low cost and move to more expensive ways as the cheaper options are depleted. E.g. In the above plot, the first 500$ investment gives back 1000$; however, we get less and less value back each time we invest 500$ more until eventually the curve flattens and we cannot produce more money anymore.
Companies control where they’re in the curve by controlling their cost. Some have caps in their cost, some aim for a certain Return of Investment (ROI = 1/ Cost of Sale). A typical formula to model the above curve is the iso-elastic model which applied to our advertiser value gives us the following
A/B Test — Calculations
We use the above equation to model the combined Advertiser Value produced by all the campaigns in our engine. We don’t know the reality of the advertisers, but we model the bits we do know and make the below assumptions which have worked for us to make educated decisions until now:
- The eta (η) of the above equation fully models the market in which the Advertiser is operating, and it’s not affected by our Engine.
- The alpha (α) of the above equation fully models the advertising engine (model A), and it’s not affected by the market.
- All the advertisers operate in a similar enough market so we can use a single eta for all of them
In an A/B Test experiment, we get an Advertiser Value measure for the current engine A (point 1 in the below plot) and another for the new engine B (point 2a if the new engine spends less than the old one overall, or 2b if it spends more)
As per above, we need to adjust for the advertiser’s reactions when a new model is released (orange line in plot). In practice, we assume the final equilibrium point will range between advertisers keeping their cost constant (iso-Cost above — point 3a in the plot) to keeping their ROI constant but spending more (iso-ROI above —point 3b in the plot). I.e. We don’t calculate the iso-Sales option mentioned in the ‘Market equilibrium’ introduction.
We have enough data to plot both engine curves (models A and B) after an A/B Test because:
- We constantly track the eta of the market with our production engine (model A). We assumed the eta in the models is a factor of the market, not the engines; hence, it’s the same in both equations.
- We have a sample point (cost, advertiser value) for each engine. The Advertiser Value and Advertiser Cost of the ref and test populations from our A/B Test. Points 1 and 2a or 2b in the plot: Advertiser Value B
- Knowing eta and a sample point, we can calculate alpha for each model.
Note that we scale linearly the originally measured (cost, advertiser value) point in the A/B Test to 100% of the population because we assume the other part of the population would behave identically if exposed to the same model. E.g. If we measured 25K euros in cost for the test population but the test population represents only 25% of the test, the value we use for Cost in the equations is 100K. This is ok because we do random assignments for our A/B Tests
Note also that we do use different eta values when working on A/B Tests whose populations are specific scopes that we know have different eta values.
Having alpha and eta in both models, we can calculate our equilibrium points
Iso-Cost (Point 3a)
The advertiser doesn’t increase the budget. We use the new model curve (alpha_B) and we find the advertiser value at the same investment as before (cost_A)
Iso-ROI (point 3b)
The advertiser increases the budget to achieve the same ROI as before. We need to find out the cost_B’ for which the ROI is as before. We can do so because:
- We have the ROI from the old model. The sample point (cost_A, advertiser_value_A) from the A/B Test.
- We can describe the ROI on the B model in terms of alpha_B and cost_B‘
- We know that both ROIs need to be equal, leaving cost_B’ as the only unknown in the equation
We use cost_B’ in the new model equation to find our new advertiser value and compute the relative increment.
In our dress case, we get the following numbers
Note that the case illustrated in the table is only 2.b when the new engine in the A/B Test spent more than the previous engine.
I showed you how we reduce Type 1 errors in A/B Tests analyses by combining many business objectives into a single metric (Advertiser Value) and how we adjust that metric based on an economic model (iso-elastic model) to better predict the reality once the new engine is deployed to production.
I would like to thank all Criteos who helped me understand the above mentioned metrics and methods when I arrived at Criteo, as well as all the colleagues who helped me review this post.
Don’t forget to head over to the latest contributions from the Data Team to the Medium community:
Hackathon 2020: Dare to Inspire — Our first virtual Hackathon!
A report by Chai Ei Lai, Maya Noguchi, Olivier Koch & Sebastian Riera