Universal Holdout Groups at Disney Streaming
At Disney Streaming, we strive to make quality decisions about which features to ship based on the results of rigorous A/B experiments, or online randomized control trials.
A/B experiments are well-suited for driving individual product feature decisions. We typically design A/B experiments with the following things in mind:
- The change is assumed to be risky. We design experiments to sample as few users as needed to maintain statistical significance in order to limit the risk of introducing a potentially negative experience.
- We want to make a decision as soon as possible. We typically only run experiments for 2–3 weeks to expedite the time-to-decision.
This strategy is effective for making decisions quickly and safely. However, at Disney Streaming we often want to use experimentation to not only make go/no-go decisions, but also to provide reasonable estimates on the expected impact if we were to roll out the feature to 100% of the population. These dual expectations for experimentation — making decisions while limiting risk and providing precise impact estimates — lead to a few problems:
1. Experiments do not enroll enough users to provide precise impact estimates.
Because we run experiments on as few users as possible to reach a specified level of statistical significance, experiments may end up being statistically significant, but with little precision in terms of the estimate. For example, it is not uncommon to see the results of an experiment in which a metric’s confidence interval looks like this:
Looking at the results above, we clearly reach a positive statistically significant result on our metric of interest which allows us to have conviction in rolling out the treatment to production. However, because the confidence interval is so wide, and so close to zero, it isn’t at all certain what the expected impact from this test will be. In other words, while it’s unlikely that this change is negative for our experience, it’s unclear whether this is a minor or major improvement.
2. Experiments do not run for long enough to estimate the long-term impact of a feature change.
Because we only run experiments for a short period of time (typically 1–3 weeks), the short-term impact from a product feature may not equal its longer-term impact. For example, a new personalization algorithm may succeed because of a novelty effect; one algorithm’s performance advantage over another may fade over time. In addition, certain metrics are lagging indicators where we cannot observe any effects unless we extend the time horizon.
3. Individual experiment results cannot be leveraged to estimate the cumulative impact of a series of changes.
Individual experiments typically run in isolation, but our stakeholders are usually interested in estimating the combined impact from multiple product changes (i.e. how much value did the product roadmap in its entirety unlock?). Because each product change is not independent, the combined impact of multiple product features may not equal the impact naively summed across experiments.
For example, let’s say we ran two experiments around the same time, each of which increased our metric of interest by 1%. If we release both changes into production, the total impact may be exactly 2% (completely independent changes), less than 2% (some cannibalization), or more than 2% (compounding effects). The experimental data alone cannot tell us which of these outcomes is more or less likely.
Introducing the Universal Holdout Group
To overcome the limitations of traditional A/B testing, we borrowed a tool which is common in marketing campaigns: the universal holdout group. The idea of a universal holdout group is to hold back a randomly-sampled small percentage of users from all product changes for a period of time. This allows us to then compare metrics between the users who receive the production experience and those users who are held back from any product changes. As a result, we can accurately determine the cumulative, long-term impact of product changes.
With a universal holdout group, we hoped to achieve the following goals:
- Determine accurate lifts of cumulative product change efforts.
- Verify if changes made have lasting impact, or are ephemeral.
- Observe potential long-term changes in metrics that the core product typically can’t influence with a single change, such as retention.
- Innovate faster by running more experiments simultaneously on fewer users; leave it to the universal holdout to evaluate the lift of the winners.
Over the past seven months, we have been piloting Disney Streaming’s first universal holdout groups on the Hulu subscriber experience, enabling us to accurately assess the impact of product investments on our KPIs. Our experience has yielded some surprising results and insights.
Designing A Universal Holdout Strategy for Hulu
When we designed our Universal Holdout strategy, we considered a few parameters.
How long should we run each holdout for? On the one hand, if we hold out users for too long, there are increased engineering costs of maintaining two separate experiences for every experimental change. Additionally, a very long-running holdout will negatively impact the subscribers who are held out, since they are not exposed to the latest features. On the other hand, if we hold out users for too short of a time period, we cannot assess the long-term impact of product changes. Balancing these needs, we settled on three months, or one quarter, as the time period for each universal holdout group. Every three months, we “reset” the universal holdout group, sampling a distinct set of subscribers to be held back.
What happens if we release a feature at the end of the quarter? Because one of our goals was to assess the longer-term impact of product changes, if we released a feature via the universal holdout mechanism at the end of a quarter, this would mean that the period for which the change was held out in the universal holdout would be extremely short. The easy solution would be to simply wait until the following quarter to begin to release the product change, but this would slow velocity towards the end of every quarter. As a result, we designed our holdouts with an additional one-month “evaluation period.” For the first three months — the “enrollment period” — we actively sample a fixed percentage of visiting users into the experiment. For the fourth month — the “evaluation period” — we stop sampling users into the experiment, and assess the impact of changes over the course of one month. This enables us to ensure that if a feature is launched towards the end of the quarter, we are collecting at least one month of data to assess its engagement efficacy.
How many users should we sample? We performed a traditional power analysis to determine the number of users for the universal holdout. We enrolled a sufficient number of users into the holdout to detect a 1% change in our key product metric, hours watched per subscriber. Why 1%? A 1% change is large enough to drive a financially meaningful ad revenue impact for Hulu.
Universal Holdout In Practice: Two-stage Experimentation Process
At Hulu, we designed a two-stage experimentation process to align our teams to our new universal holdout strategy. Each proposed product change moves through two stages:
Stage one: Run a standard A/B experiment for 1–3 weeks.
Stage two: Release the product change to all users who are not in the universal holdout group (if the results from stage one are positive).
Lessons Learned from Hulu’s Universal Holdout
We learned that in practice, implementing a Universal Holdout is harder than in looks.
First, we learned that for certain parts of our tech stack, it was prohibitively expensive to maintain two separate versions of an experience for a four-month period of time. For example, a particular machine learning model improvement we successfully tested in an A/B experiment could not advance to the universal holdout stage because it would have required too much computation to run both models simultaneously for 3–4 months in a cost-effective way.
Second, we learned it’s easy to make mistakes with universal holdouts. As the number of successful experiments accumulate within the universal holdout mechanism, the number of places where code branches within the stack increases. As a result, we ended up with issues where certain experiences ended up not being exactly as we intended to test within the universal holdout for a period of time until they were noticed by product managers, analysts, or QA teams.
Takeaways: Why Your Organization Needs a Universal Holdout
Despite these setbacks, we were still able to capture immense value from using universal holdout groups.
Before we implemented our universal holdout groups, the way in which our Product and Tech teams messaged the value delivered for subscribers was by summing over the impacts of individual experiments conducted over the course of a quarter. E.g. 10 changes which each made a 1% impact on hours streamed would result in a (1.01)¹⁰ — 1=10.46% increase overall to hours streamed. However, from our universal holdout, we found that the cumulative effect of our product features was much smaller than expected.
During our six-month pilot with universal holdouts on Hulu, we discovered that the engagement-driving impacts of one feature often partially cannibalize the impacts of another feature, which means the cumulative impact ends up being less than if the impacts of each feature were summed independently.
In addition, we learned that several features which performed well in A/B testing saw decreasing impact, or inconsistent impact over time. This was especially true when it came to the algorithms we released on Hulu, where they may have performed particularly well during the first stage A/B testing period either because of seasonal factors, or because of a novelty effect.
Overall, we now view universal holdout groups as a key instrument for any organization serious about assessing impacts with data. As a next step, we’ll be extending our methodology across the Disney Streaming portfolio of products.