Accelerating Ranking Experimentation at Thumbtack with Interleaving

Published in

Thumbtack Engineering

9 min readJul 26, 2023

While A/B tests are rightly considered the gold standard for causal inference, they can also be costly. A typical ranking experiment takes many weeks to complete. This wouldn’t be a big problem if we only had a handful of ideas to try, but Thumbtack’s rankers are powered by ML models that could be improved through any combination of new features, model architectures, and/or training techniques. In other words, there’s a vast space of potential improvements to evaluate, and with A/B testing alone, we don’t have the time to systematically explore it. That’s why we have turned to an experimentation technique called interleaving that powers our tests up to 100X faster than A/B testing. Interleaving is specifically designed to accelerate experiments involving ranked lists by quickly identifying the better of two possible rankings, allowing us to evaluate many more ranking ideas over a much shorter period of time.

How interleaving works

To evaluate the impact of a new ranker in an A/B test, we randomly split Thumbtack’s consumers into a control group that sees ordered search results generated by our existing production ranker and a treatment group that sees ordered search results generated by the new ranker. We then perform a hypothesis test to check if there is a statistically significant difference in engagement (e.g. click rate) between the control and treatment groups. In contrast, an interleaving test does not split the consumers into separate groups but rather serves a combined list of search results from the control ranker and the treatment ranker interleaved together.

An interleaved list is built over rounds where in each round the control ranker and treatment ranker each submit their highest-ranked professional that isn’t already in the interleaved list [1]. If both rankers choose the same professional in a round, then one of the rankers must move down to their next highest ranking professional. A coin flip at the start of each round determines whether the control or treatment ranker goes first in submitting their professional to the interleaved list; this bit of randomization mitigates the impact of position bias that is caused by consumers being more likely to contact pros that are ranked higher in a list (in contrast to the alternative where one ranker always picked first in each round).

Rather than calculate the difference in engagement between control and treatment groups (such groups do not exist in an interleaving test), we instead examine what proportion of contacted professionals were interleaved from the treatment ranker. This proportion is called the preference signal; a preference signal greater than 0.5 indicates that professionals recommended by the treatment ranker were more preferred over the control ranker. To determine whether a preference signal is statistically significantly different from 0.5, we simply perform a one-sample Z-test of proportions.

Why interleaving works

The above describes how interleaving mechanically works but doesn’t provide much intuition about why this process can power so much faster than A/B testing. There are three reasons that can help develop intuition for why interleaving is so promising.

Reason 1: 1-sample tests are faster than 2-sample tests

In an A/B test, evaluating the statistical significance of the difference in engagement between control and treatment groups requires a 2-sample test; the control group forms one sample from which we estimate the distribution of an engagement metric like click rate, and the treatment group forms the second sample from which we do the same. Both control and treatment distributions come with variance, so the difference between these two distributions also has a variance: in fact, it’s the sum of the variance of both distributions. In contrast, interleaving requires only a 1-sample test for the preference signal, eliminating an entire source of variability.

Reason 2: Double the sample to estimate one distribution

The second reason is related to the first; because an A/B test requires observing two distributions, the available experiment sample has to be split in half so that each half can be used to estimate one of the experiment groups’ distributions. This means that not only does an A/B test contain two sources of variance, but each source has a higher variance due to the reduced sample size used to observe engagement. On the other hand, interleaving can use the entire available experiment sample to observe the preference signal with a lower variance. Put differently, part of the reason why interleaving can power so quickly is that it only requires estimating one lower variance distribution compared to A/B testing which requires estimating two higher variance distributions.

Reason 3: Stronger signals from forcing a choice

Finally, the interleaving preference signal is often more sensitive to the treatment effect than the difference in engagement in an A/B test because it extracts additional signal about consumer preferences. This additional signal is gained by forcing a consumer to indicate a preference on an interleaved list even if the consumer would have contacted a professional from either A/B test group. One way to unpack this is to think of our consumer population as consisting of four mutually exclusive segments of users:

Consumers who would have contacted a professional if they were shown a list from the control ranker, but not the treatment ranker. We assume these consumers would show overwhelming preference for the control ranker when shown an interleaved list.
Consumers who would have contacted a professional if they were shown a list from the treatment ranker, but not the control ranker. We assume these consumers would show overwhelming preference for the treatment ranker when shown an interleaved list.
Consumers who would not have contacted a professional from either list. We assume these consumers would make no contacts on an interleaved list and thus make no impact on the interleaving preference signal.
Consumers who would have contacted a professional from either list. We assume these consumers would also make contacts on an interleaved list.

One of the key assumptions about interleaving is that users in this fourth segment exist and that their preferences are ultimately predictive of which ranker will perform better in an A/B test. When this assumption holds true, interleaving extracts additional signal about user preferences that an A/B test cannot, and stronger signals (i.e. larger effect sizes) require less sample to power.

Validation & results

Although interleaving has many favorable properties and rests on reasonable assumptions, the validity of these assumptions are not guaranteed; they vary from domain to domain. Thus, it was crucial for us to test how effectively interleaving would perform with our consumers and rankers. We first ran interleaving tests that compared our current production rankers against past production rankers that we know had lower performance. After that we ran corresponding A/B tests to:

1. Validate whether the interleaving results agree with A/B tests results

2. Evaluate how much more sensitive interleaving is as compared to our A/B tests.

As part of this validation plan, we also conducted an A/A interleaving test in which we interleave a control ranker against itself to ensure that our interleaving framework and downstream analysis were correctly implemented and that we were not introducing unintended biases. The validation results are summarized below and show that interleaving yields directionally consistent results to those of our A/B tests. Recall that preference signal values greater than 0.5 indicate preference for the treatment ranker, while preference signal values less than 0.5 indicate preference for the control ranker.

Next, we conducted bootstrap analyses to evaluate how much sensitivity gain interleaving provides. We ran 1000 simulated experiments at each bootstrap sample size, and for each set of 1000 simulations, we calculate the proportion of simulated results that correctly identify the stronger ranker (i.e. the agreement rate). The results below are from a bootstrap analysis where “past ranker 2” was the treatment. They show that for this particular experiment, interleaving requires only ~400 samples to reach a 90% agreement rate, while the A/B test requires ~40,000 samples to achieve the same, a 100X improvement in sample efficiency.

While we were primarily interested in the preference signals of the interleaving tests, interleaved lists themselves also have engagement metrics like click rates. We noticed that the engagement of interleaved lists skews much closer to that of the stronger ranker in the interleaving test. This is a happy surprise; if the treatment ranker is stronger in an interleaving test, we reap the rewards of the treatment ranker, and if the treatment ranker is weaker, its negative impacts are minimized when running a test in production on live traffic.

Conclusion

Our validation tests convinced us that interleaving is an effective method for quickly evaluating the promise of new rankers. We now use interleaving as part of a two-stage experimentation process. In the first stage, we perform interleaving tests on a variety of newly developed experimental rankers to identify the most promising candidates. Because interleaving tests power so much more efficiently, we can complete multiple simultaneous tests in a matter of days. Then in the second stage, we perform A/B tests with the top candidate rankers identified from the interleaving tests.

It’s important to note that while interleaving can be up to 100X faster than A/B tests in identifying the stronger ranker, it is not an outright replacement for A/B testing since the preference signals measured via interleaving are difficult to substitute for metrics like company KPIs that A/B tests provide. Nonetheless, interleaving allows us to test many more experimental rankers in production than we otherwise could, and helps us identify the most promising candidate rankers to promote to a lengthy A/B test.

Next steps

As we leverage interleaving in our two-stage experimentation process, we will continue to monitor the relationship between interleaving and A/B test results for consistency and additional insights. As for where we go from here, we believe the benefits of interleaving will scale well to other product use cases, like keyword autosuggestions in the search bar and category recommendations to consumers. We have also only barely scratched the surface of how we can modify the interleaving preference signal so that it can be relevant to a wider array of A/B tests and KPIs. If any of this past or future work sounds interesting to you, please consider joining our team!

Acknowledgement

We are immensely grateful for the help of so many of our wonderful colleagues that made this work possible. We would like to thank Navneet Rao & Wade Fuller for advocating for and prioritizing this work as well as Richard Demsyn-Jones for his expertise and thoughtful guidance on how to perform power analysis for interleaving experiments. This work would also not have been possible without Bharadwaj Ramachandran, who led the effort to redesign adjacent systems to accommodate interleaving, Eric Ortiz for important improvements to experiment configuration setup and data tracking, and finally to Sam Oshin & Tim Huang for their extensive code reviews and critical engineering advice.

References

[1] Chapelle, O., Joachims, T., Radlinski, F., and Yue, Y. 2012. Large-scale validation and analysis of interleaved search evaluation. ACM Trans. Inf. Syst. 30, 1, Article 6 (February 2012)

Additional Reading

We’ve greatly benefited from the insights shared by teams at other companies about their experiences applying interleaving to their experimentation processes. We recommend giving these informative articles a read.