Using Holdout Groups to Quantify Marketing Campaign Lift

Sunny Shapir
Bluecore Engineering
8 min readMar 2, 2021

Bluecore is an e-commerce marketing platform that increases campaign revenue using personalization, but how is that measured? Lift is a key metric in determining the efficacy of marketing messages since it measures increased or decreased revenue and/or conversion as a result of an audience receiving a marketing campaign. Running a holdout group test is a useful way to measure the incremental lift of a marketing campaign.

How do holdout tests work?

A holdout test begins with a randomized split of a campaign audience into test and control (or holdout) groups. Only customers in the test group are sent a campaign message, while the holdout group is intentionally held out.

Image made with lucid.app

The behaviors of the two groups are then compared to draw meaningful conclusions about them and the attribution to the marketing campaign to those behaviors.

Why are holdout tests useful?

Holdout tests measure the long-term impact of a marketing campaign to avoid drawing premature conclusions about a customer’s shopping behavior patterns. They help answer important questions: did the holdout group make as many purchases, buy similar products, or spend a similar amount of money as the test group even though its members were explicitly excluded from a marketing campaign? The answers allow marketers to gain valuable insight into customer behavior and whether a specific campaign or category of campaigns was the actual cause of an increase in revenue or not. If the campaign proves to generate considerable lift then the test could be ended, allowing the campaign to continue to deploy to the entire original audience of both the test and control groups.

Implementation Challenges

  • Complex feature offerings

Implementing and integrating another audience-level filter on top of existing Bluecore features that optimize send time across customers and minimizes the time between each individual campaign send was not trivial, especially considering these features could also be incorporated into multi-touch workflows. These are some of Bluecore’s core features, adding an extra layer of complexity to an already compounded architecture.

  • Rapidly evolving system

In the last year, Bluecore migrated over 90% of personalized marketing campaign sends from Google App Engine over to Google Kubernetes Engine. This project was not the first and undeniably will not be the last large migration as our platform continues to rapidly scale. Designing for a system that is, and continuing to become even more, largely microservices oriented made for key design decisions that could make or break our implementation.

Image made with lucid.app

Communicating the critical message of “this customer is held out, please do not trigger a campaign send” to all parts of the system network was essential to the proper functionality of test implementation.

  • Test accuracy

To allow Bluecore customers to conduct holdout tests properly, we were tasked with ensuring that the same customers are held out for the entire duration of the test while also preventing any correlation between different campaign holdout tests. There are multiple features that could prevent a customer from being included in a campaign audience, so the ability to confidently attribute a customer’s exposure, or lack thereof, to a marketing campaign to the holdout test was crucial to produce accurate analytics. Lastly, as obvious as it may seem, a great degree of testing was required to ensure that held out customers were not accidentally exposed to campaign messaging at any point over the life of the campaign.

  • Past attempts

Bluecore customers found holdout tests so valuable that they were willing to simulate it with complex audience configurations prior to the feature being built.

The absence of the feature was a major pain point for customers, which resulted in a complex audience configuration to mimic test functionality to achieve a similar result. — Image by author

While these audience configurations relied on previously built infrastructure, they were designed to forcefully halt a campaign send after resources were used to handcraft a personalized message for each customer in the “holdout” group. At scale, this process was extremely inefficient and required manual labor at the expense of system resources. It’s also worth noting that performance prioritization should always be top of mind, especially when it lives in the critical campaign launch workflow.

Bluecore’s Implementation

Enabling a campaign holdout test with just one click

What did we do? Giving customers the ability to enable the test during campaign configuration made it simple to get started. More importantly, the size of the holdout group was designed to be entirely configurable and available before campaign execution. This was made possible by the seamless integration of holdout groups into the Bluecore product.

Image by author

How did it solve an implementation challenge? The option to include a holdout test was added to every campaign configuration interface, eliminating the need for special audience configuration to support the test. The holdout groups feature was also clearly differentiated from others while offering compatibility for those looking to use multiple audience related features simultaneously.

Applying test versus control (or holdout) group filtering

What did we do? Since subsystems of Bluecore’s audience query engine leverage Google BigQuery, we extended its functionality to support holdout group filtering. Knowing that many audience related features were optimized by folding filtering directly into audience generation from the get-go, however, we opted to apply our holdout group filtering using a simple BigQuery standard SQL query after these features are applied.

How did it solve an implementation challenge? Accurately applying the holdout group percentage on what would have been the final campaign audience, we were able to ensure the accuracy we discussed earlier by offering a concrete yes or no answer to the question, “when did specific customers qualify for a marketing campaign and were they in the holdout group?”

Isolating the holdout test feature into its own query offered concrete metrics to measure efficiency as well as the added benefit of applying more straightforward filtering to the generated audience. With performance tracking woven into feature design, bottlenecks were made clear during the testing phase instead of the release phase. Furthermore, using dynamically generated SQL instead of programmatically filtering the audience guaranteed existing feature compatibility, since these features lived almost entirely in BigQuery themselves and injecting the same filtering mechanism into their executions resulted in minimal added processing.

Audience splitting

What did we do? Use email hashing to split the audience for a simple yet effective mechanism that does not require any additional storage of the actual hashed value. To enforce uniqueness of the output, we salted each customer email that appeared in the audience. A salt is a random value that is added to the input of a hashing function (customer identifiers like email addresses in our case) in order to create a unique hash even if the input itself is not unique.

We then segmented audience members using the following process:

  1. Assign a randomly generated alphanumeric salt per campaign running a holdout test that does not change throughout the life of the test and/or campaign, unless the feature was disabled or the percentage was reconfigured.
  2. Append salt to each customer email in the audience.
  3. Hash the email and salt combination using FARM_FINGERPRINT function (which uses Fingerprint64 function of the open-source FarmHash library), generating an integer hash that will remain consistent on every attempted hash of the specific email and salt combination.
  4. Compare the generated integer hash to the holdout group percentage configured by the partner running the test to bucket a customer into one of the two groups.
Feature implementation from the point it is enabled through the final audience selection. — Image made with lucid.app

How did it solve an implementation challenge? Coupling the use of the FARM_FINGERPRINT hash function with a campaign-specific salt, we were able to ensure that one holdout test was uncorrelated with another while also enforcing the requirement that a customer remains in the respective group throughout the duration of the experiment. This is due to the fact that the email and campaign-specific salt combination will hash differently across different campaigns, despite coincidental overlap between audience members.

Introduction of an intermediary “pre-filter” table

What did we do? Use a new table to hold what would have been the final campaign audience before applying any holdout groups related processing, while storing the new audience (consisting of just the test group) in a final results table.

Image made with lucid.app

How did it solve an implementation challenge? By allowing other pipelines and services to access only the relevant portion of the audience where the service expects to be, these services do not need to be made aware that the test is even running and eliminates the need for even a single code change in other systems.

The ease at which we could regenerate either the test or holdout group with accuracy was extended further into aggregation of test metadata. This continues to offer the ability to check a single table with concrete flags indicating whether a customer was held out from a specific campaign run. Not only does such a table enhance analytics report generation by eliminating the need to aggregate results at a later time, it is an extremely valuable debugging tool that could also be leveraged to validate feature implementation.

Running a holdout test on a marketing campaign is an insightful way to measure the long term effectiveness of a campaign in terms of lift by explicitly holding out a subset of an audience from receiving campaign messaging and comparing held out customer behaviors to those that were exposed to it. These metrics inform future marketing campaign decisions and allow customers to truly maximize their customers lifetime value wherever possible. Bluecore’s holdout group implementation enhances this effort by intelligently running holdout tests with the click of a button and empowering marketers to offer their customers a truly personalized shopping experience.

Taking the holdout test feature from design to development to release in just four months was an extremely rewarding experience, especially since it has been highly requested by our customers for years before it was built. Collaborating with just about every team across engineering offered the added benefit of learning the intricacies of some of Bluecore’s major services from our very best engineers along the way. Assessing tradeoffs, prioritizing best practices, and designing for customer success made for a truly exciting development process. The fact that many of Bluecore’s enterprise retailers leverage the holdout group feature is a testament to the value that it adds through its elegant incorporation into the product and the important customer insights that it uncovers.

If you’re interested in designing and building impactful features like the holdout group test, check out our careers page!

--

--