The simple explanation of holdout tests for product managers

Konstantin Grabar
ProductDo
Published in
7 min readAug 21, 2024

The development of a modern major product is hard to imagine without running A/B tests. But have you ever thought about the real impact your experiments had on your product? Just because you’ve rolled out a bunch of successful A/B tests in production does not guarantee success.

What are holdouts?

Let’s start with an example. You’ve conducted a series of successful A/B tests, expect a X% increase in your target metric, then roll them out to 100% of your audience and… don’t get the expected result. There could be various scenarios here. For instance, there is an increase, but it is too small. Or there is no increase at all.

Why could this happen? Firstly, among successful A/B tests, there is always a large percentage of false-positive results. This is based on the principle of running A/B tests with a certain level of statistical significance (usually 90% or 95%).

Secondly, even if your A/B test is truly successful, no one really knows how strong the effect will be when rolled out to 100% of the product’s audience.

Thirdly, the success shown by an A/B test in the short term does not guarantee that this success will be maintained in the long term.

However, there is a way to measure the effect of rolled-out A/B tests. This method is called holdout. It is implemented quite simply (at first glance). We keep a portion of users who do not see our changes for a long time. Then, every quarter, we measure the difference between these two groups of users.

This method is extremely useful and effective. At least from all the known methods to me, I haven’t encountered anything more practical. Here are a couple of interesting cases from me and my colleagues:

Case 1. A fintech company was developing a complex activation and loyalty program, within which they tested a whole set of measures to increase activity and retention: a discount on the tariff, a bonus for additional activity, and an offer from partners. 5% of users were completely excluded from the program so that after a year it would be possible to objectively assess whether it was worth continuing with this project. As far as I can judge, the result was positive. The program is still active to this day. It was possible to show a positive impact on important long-term metrics through the holdout.

Case 2. Recently, Andrew Mende in a blog post mentioned that “smart discounts” are a very interesting mechanism from a financial perspective. Instead of giving discounts to everyone, they are given only to those for whom it might change the decision from ‘won’t buy’ to ‘will buy’ using “smart algorithms.” The team working on this will endlessly tune their models and boast that they are becoming more effective day by day. However, a local holdout (a group that never sees discounts) shows that while the program overall may make sense, investing a lot in salaries and cloud computing to improve efficiency is economically unjustified. It is better to leave it alone and have the team focus on something with a greater impact.

What types of holdouts are there?

There are usually two types of holdouts: global and local. Global means that the specified group of users is isolated at the level of the entire product and all teams. Naturally, this is the most accurate way to measure long-term effects and determine the real impact of rolled-out A/B tests. Local implies that the holdout group works only at the level of your team or department (usually the latter). This way, global effects cannot be checked, but effects in a specific part of the product or specific A/B tests can be observed.

What problems might arise with this? Firstly, users in your holdout group will probably not be happy about not receiving new updates. Therefore, it is recommended to periodically rotate users in holdouts (for example, once a quarter or six months).

But the main problem lies in organizing this process. At the team or department level, there are usually no big difficulties. But at the global level, organizing this becomes almost impossible if there are many teams and departments. Essentially, this means that all changes you make to the product must take the holdout into account. And if holdouts are rotated quarterly (or there are multiple holdouts, such as quarterly and annual), it becomes even more complicated. It only takes one team to ignore this rule once for all efforts to be in vain. All releases, bug fixes, etc., must be planned with isolated user groups and their rotations in mind, which is usually extremely difficult to organize.

So, it turns out that the mechanism is seemingly effective and quite simple to understand. But organizing it in practice usually only works at the local level. Paradoxically, the organization of product releases with isolated user groups becomes the main obstacle. The value of the knowledge gained often does not justify the costs of implementation and maintenance. However, despite this, the use of this method at the local level is quite common and usually does not cause such problems.

I also heard that instead of a global holdout, some use the following heuristic: the combined impact of several A/B tests is well assessed by summing the lower bounds of the confidence intervals of the effect. For example, if you have ten experiments with a positive effect of +1% ± 0.7%, their cumulative effect will be much closer to (1% — 0.7%) * 10 = 0.3% * 10 = 3% than to 1% * 10 = 10% (understandably, in practice, you need to sum absolute, not relative improvements, i.e., not in percentages, but in money, users, or conversions).

What other problems do holdouts solve?

Let’s discuss in more detail why during the A/B test launch, we may see a very different result than we will see later. This can happen due to various short-term and long-term effects. They come in different types but all boil down to one rule. The metric change (or lack thereof) that we observe during the experiment may be just a temporary effect. This effect is hard to see in the short term, so it is better observed and measured over a long period.

A kind of novelty effect can be the halo effect. Users are happy to use some new feature not because they need it, but simply because they like the product in general or some other feature. They direct positive emotions from it to the new feature in your experiment. But over time, such positive perception will wane. Not to mention that less loyal users will be unhappy.

The opposite could be the learning effect. You made a cool feature that solves a user problem… But the metrics are not growing or growing slowly. This can happen if it takes time to adjust to the new format or if users need time and effort to learn to see this effect. Even if some change is progressive and useful, the audience usually needs time to accept it.

All these things (and not only) are caught in long-term isolated groups. Moreover, you can additionally implement the process the other way around by organizing something like a quarantine for dubious tests. For example, if you see signs of the above effects or simply don’t like the metrics in the test, it would be safer to first gather several such tests together and release them to a limited number of users. After some time, you can measure the result and see what happened with the metrics. Always, when working on A/B tests, keep in mind the long-term and short-term effects.

What criticisms of holdouts exist?

On the internet, you can also find an alternative opinion criticizing holdouts. This is actually very good because it allows you to study not only positive but also negative experiences in the industry. In short, the criticism mainly boils down to the fact that with a large group imbalance (for example, when the holdout is only 1% of your traffic), the power of such a “test” drops significantly. Power is the probability of seeing a difference between groups when it really exists. But this actually applies to any tests, not just holdouts. And it does not mean that your holdout will not work or should not be used. It is just assumed that to get the most credible result in most cases, you need to drive more people into the holdout groups and keep them there longer. Or try to run holdouts on a 5/95 or 10/90 proportion (as an option) to reduce the imbalance. Otherwise, the holdout will not always show a truthful result, and some things can be missed. But is this such a critical problem? I think not. Firstly, because holdouts are usually kept for 3 months or more. This usually allows for obtaining reliable results. Secondly, several tests are usually put into holdouts at once, and their results usually cumulatively overlap at least minimally (otherwise, how will your metrics grow from quarter to quarter?). Accordingly, we will need less audience to see the result.

Conclusion

In conclusion, the most important thing. As product managers, you don’t necessarily need to understand in detail what an analyst or data scientist should understand. As a product manager, you just need to know that such a mechanism exists, it is used in large companies, but it has its areas of application and trade-offs. This, by the way, is true for most tools in statistics, don’t be afraid of it. And if you want to master your a/b testing skills you can take the online course. I will try to explain complex tools in practice with simple language and examples.

May false positives and false negatives bypass you!

--

--