Incrementality — Beyond The Obvious

Why “Transparency” is critical to evaluate true “Performance” of an Incremental Test

Published in

Thinking Programmatic

7 min readMay 16, 2019

Over the last few weeks, I had multiple interactions with marketers around their programmatic in-housing strategy, the benefits of doing so and the common pitfalls that they should be aware of.

During these discussions, one common topic that surfaced a lot was around incrementality. While most people are aware of the basic nuances of incrementality, I felt there was still some gap in understanding the different methodologies involved in running incremental tests, the pros and cons of one approach over another and the existence of various biases that can set in if the tests are not run properly.

While the major driver of running an incremental test is often to measure “Performance” of campaigns and their real impact, however, marketers should also be aware why “Transparency” is critical while measuring and evaluating incremental lift results.

What is Incremental Lift?

Simply put, it is a measure of how much “incremental” increase in an “event” happened because of a campaign that the marketer ran. Many marketers run install and retargeting campaigns and measure metrics like CPI, CPA, RoAS etc to evaluate performance of such campaigns. While these are important metrics to be measured, they don’t answer the key question of incrementality. Did the user install an app because she saw an ad that enhanced her interest or was the user going to install the app anyways and the campaign just took a pie of the organic install base?

To run an incremental lift test, users are divided into two groups — Test group (TG) and Control group (CG) — and the performance of these groups are measured against the KPI event, say the install. The test group users are shown advertiser’s ad while the control group is either not shown an ad or shown a placebo ad (more on that below). If the test group results show a statistically significant uptick in the KPI metric (installs in this e.g.), then, it can be inferred that the campaign did have an incremental impact in driving more installs.

Key nuances to watch out for

Almost all the platforms which support incremental lift tests, follow the same process as described above. However, as always, the devil is in the details. As a marketer, it’s essential that you are aware of these nuances since it can impact the final results

Target and Control Group — In an ideal lift test, the users should be divided into Target group (TG) and Control group (CG) randomly. The random split can either be like an A/B Test- 50:50 split or it can be a specific ratio split, say 80:20 (TG:CG). By splitting users into a specific ratio, you can control and minimize the negative scale impact on your campaign while the lift test is being run. If the users are being divided randomly, splitting the users in a ratio like 80:20 should deliver similar results like an equal 50:50 split while having minimal impact on the campaign scale.

While the split ratio is important, an even more critical aspect is how the users are being split into Target and Control group. There are a lot of ways a vendor can skew Target and Control groups.

For instance, let’s say you have a gaming app whose core users are in the demographics of 20–30 years old. A blackbox vendor can add more users who are >35 years old into control group bucket. This will induce a bias even before the test has started. Because users aged >35 years have less inclination to play your game, the control group performance will always be lower than target group and the final incremental lift that you measure will be biased and overstated. Similar type of biases can be introduced by using other variables like app placements, creatives etc. The only way to ensure you are confident of the lift results is by ensuring you have full visibility into the tests being run. So the next time someone comes and says we have a “fully automated way” of running incremental tests, we’d recommend you drill down into these specific nuances.

Handling Control Group Users — There are different methodologies of how to handle Control group users.

The three most common methodologies are:

A) Placebo Ad — In this methodology, the control group users are shown an unrelated PSA ad like a charity ad. Since the ad shown has nothing to do with the advertiser brand, it doesn’t influence the control group users to do a follow up action like installing the app.

The issue with Placebo ad is that the advertiser has to spend money (media cost) to show ads to these control group users while the ad has nothing to do with the advertiser’s brand. However, by controlling Target group: Control group ratio as described above, this impact can be minimized and should not be a major concern for advanced marketers who are looking to get true incremental uplift results.

Another potential issue that can happen with placebo ad is that let’s say you wanted to evaluate lift results for a gaming app. Users who are likely to click on a gaming ad may be different from users who will click on a charity ad. Hence, this can result in a situation where you may not be comparing results from similar users. However, if the users are truly randomly distributed within a certain parameter (like demographic 20–30 years old), the impact of this specific issue can be minimized

B) Ghost Ad — In this methodology, the platform records when a user would have been shown an ad but stops short of actually showing the ad in order to not influence the Control Group users. This methodology has the cost benefit that the advertiser doesn’t end up spending money for placebo ads. However, it has its cons as well which can influence results.

The approach doesn’t take into account the “winnability” of the ad impression. A blackbox platform can add all the non performing users and placements in the Control Group bucket. The system will say that it “would have” bid on these users but it might have bid with such low price, that it would have never won the auction and hence would not have shown the ad to the user anyways. In this case, the Target Group results can look inflated and hence skew the incremental uplift results.

Another issue with Ghost Ad methodology is the lack of feedback loop. Because there is no impression shown, there is no click and hence no attributed install which results in no feedback to the model. If this test is run on a CPC/CPI model, then because of the lack of feedback loop, the model will start bidding less and less on these Control Group users which will induce a negative bias since the model will now bid even lower and hence select even lower performing inventory thereby again boosting the comparative performance of the Test Group.

C) Intent-To-Treat — In the Intent-To-Treat methodology, ads are shown only to target group users. However, in this approach, the results are compared between all users of target group vs control group rather than comparing results for just those users who saw the ad in the target group. This introduces bias in the results since now we are adding “noise” of those users who never saw the ad or would have never even seen the ad

Continuing with the same example, let’s say you were running a test for a gaming app and were targeting users between the age of 20–30. In reality, let’s say, you were able to reach only 40% of these users during the lift test time period. In the placebo methodology, you would have compared data between these 40% reached users only. However, in the intent-to-treat methodology you end up adding the data of the remaining 60% of users as well which adds the extra “noise” of not reached users.

Parting shots — While a lot of marketers might be already aware of the basic nuances of Lift tests, we can clearly see that the results can be easily manipulated. Either by intent or by the choice of methodology used. The scientific community continues to debate on different methodologies and their pros and cons. No one approach is perfect but as a marketer, it is essential that you are aware of these nuances and ask your partner detailed questions about how it intends to run the lift test.

Lift Tests are a great way to understand incremental performance but without the necessary transparency and control, you can end up seeing biased results.

Incrementality — Beyond The Obvious

Why “Transparency” is critical to evaluate true “Performance” of an Incremental Test

Written by Puneet Gupta