Holdout Testing and Conspiracy Theories

Dr. Jason Davis
Nov 12, 2015 · 5 min read

Holdout testing should be a fundamental form of experimentation in every engineer, product manager, and marketer’s bench, yet it isn’t.

Yet pretty much everyone and their mother runs A/B tests these days:

Making a copy change to your checkout funnel? Test the old version vs the new with Optimizely.

Sending a holiday promotional email out to your customers? Make sure to A/B test subject lines with Mailchimp to optimize open and click rates.

Buying search keywords? Run a 9-way split test optimizing across your favorite ad copy.

A/B and split tests are conceptually simple: bin your customers into two or more discrete groups, show them each different versions of your content, measure results, then declare a winner.

Yet there’s another type of test that’s conceptually very similar to an A/B test, but grossly underrepresented in today’s testing driven world: the holdout test. Where A/B tests optimize across two different messages, the holdout test sends a single message to a subgroup of users while also maintaining a control group that receives no message at all.

But why would one want to run such a test? Is it really that different from an A/B test, and furthermore, what’s the point of “testing” a control group that receives nothing?

While technically speaking, a holdout test is a special case of an A/B test (wherein the “B” is “receives nothing”), the state of the internet would disagree with this.


Let’s start by diving into some important holdout testing use cases to better understand what’s going on here:

You’re buying keywords for your Adwords campaign to drive traffic and conversions to your site. You’ve A/B tested ad copy and maybe even landing pages, and you’ve found that the highest converting keywords are those most similar to the name of your brand.

Google calls these queries navigational queries. Some folks may type “etsy.com” into their browser and visit the site directly, yet others may type “etsy” into Google and click through via search results. If you’re buying brand keywords for this search, your results may look something like this:

To a user, the first two sets of results are effectively identical. The titles and anchor text differ slightly, but they have similar sub-links that all pretty much go to the same place.

Of course, from the advertiser’s point of view (in this case, Etsy’s), they’ll get charged if the user clicks on the first set of links (the ad), but they won’t pay if the user clicks on the second set of links.

As such, the fundamental question at play here is how many among those who clicked on the ad above would have clicked on the organic (free) link had Etsy not been paying for these navigational keywords?

To answer this, you’d need to run a holdout test wherein some customers would be exposed to the page with ads and organic results, and other customers would be exposed to a treatment with just organic results.

Yet, unfortunately, Google’s Adwords manager doesn’t support holdout testing. They’ll enable you to infinitely optimize ad copy, yet they won’t help you answer the question, “What happens if I don’t spend money on Adwords?”

Internet retargeting is pervasive these days. Add something to your cart from your favorite retailer and, if you don’t complete the purchase, this product will “follow” you around the web for several days until you do.

Retargeting providers most commonly measure performance via last click attribution: add a pair of skis to your shopping cart, spend 20 minutes surfing the web, click back through a retargeted ad to the ski shop, and purchase your $800 pair of skis. If that click just cost $.50, then BOOM: that’s 1600 times return on spend.

Or is it?

The question to ask, of course, is whether or not you would have purchased those skis had you not seen that retargeted ad while reading your morning dose of TechCrunch.

To really answer this question, you’d need to run a holdout test. Retarget half your customers; don’t retarget the other half. Wait and check conversion rates across the two groups, and then compare resulting revenue, and advertising costs.

On an average day, how many promotional emails do you receive? When’s the last time you got fed up and went through 10 or so emails and unsubscribed from those that seemed irrelevant to you?

Sending too much email can result in high unsubscribe rates. Similarly, sending the wrong message to the wrong customer can lead to equally detrimental results.

For example, Birchbox’s core business is their subscription box service: $10 per month for new beauty samples to try in each delivery. While subscription is core to their business, however, they’re also aggressively marketing their e-commerce offerings.

If Birchbox emails their longtime box subscribers with a special offer to buy from their e-commerce shop, will these promotions have a cannibalizing effect on subscription rates? Will their customers cancel their subscription and buy e-commerce instead, or will they continue to subscribe and additionally buy a la carte?

As with the three above examples, you’d have a very hard time answering this without running a holdout test, yet virtually no email provider can support this crucial function.

Holdout Conundrums

Since holdouts appear to answer important, fundamental questions, why are they so minimally supported today?

Google, Criteo, Adroll, and pretty much every other shop selling you internet ads is very much incentivized you to sell more ads. So their baseline starts with running Google Ads, and ends with running better Google Ads.

Why would Google help you determine whether you should be buying navigational queries or not?

Measuring holdout effectiveness often requires careful customer segmentation and bookkeeping. Who added to their cart, who received a retargeted ad, who purchased, and what was the cost. Since ad providers don’t generally offer holdout testing facilities, careful measurement requires building substantial infrastructure to accurately track things.

Similarly, modern email providers have no facilities for explicitly not sending email (i.e. directly facilitating a control group), so careful bookkeeping is needed.

A recurring theme here is lack of sufficient tooling and support for running these sorts of tests, which won’t change until general awareness about the benefits of holdout testing changes substantially.

When you’re optimizing processes, particularly those that involve proactive outreach and/or new functionality, a holdout test is a great place to start.

Simon Systems

Simon Data's Engineering Blog

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store