Trustworthiness is vital to A/B testing

Our efforts to ensure a reliable A/B testing platform

Qike (Max) Li
Wish Engineering And Data Science
3 min readMay 24, 2022

--

Contributors: Max Li, Chao Qi, Eric Jia

At Wish, we deeply value the reliability of our experimentation platform. We published ‘Measure A/B Testing Platform Health with Simulated A/A and A/B Tests’ and ‘A/A Testing Establishes Trust in Experimentation Platform’. Those two publications demonstrate our efforts to ensure a reliable A/B testing platform. This post summarizes the study.

A/B testing plays a critical role in decision-making in data-driven companies. It is typically the final say of go or no-go for a product launch. Inaccuracies in the A/B testing may degrade all business decisions derived from A/B tests. At Wish, we continuously evaluate the reliability of the A/B testing platform. In the published articles, we share our efforts to ensure a trustworthy A/B testing platform by controlling type I errors (false positives) and type II errors (false negatives) through A/A testing, pre-assignment test, and simulated A/A and A/B tests.

Type I and type II error (Image by Ming Gong from Wish)

We run various A/A tests for different scenarios, such as different steps in the conversion funnel (e.g., impression, product click, adding to the shopping cart, and product purchase), client-side and server-side experiments, logged-out and logged-in experiments, etc. Since the experiment buckets (e.g., control and treatment) in A/A tests are identical, any statistically significant results returned from your A/A tests are false positives.

We run pre-assignment tests for each experiment. A pre-assignment test is a retrospective A/A test that uses the data X (e.g., 60) days before the start date of an experiment. A statistically significant result from the pre-assignment test indicates a biased A/B test

We also run hundreds of simulated A/A tests and A/B tests every day to measure the false positive rate and false-negative rate respectively. The simulated A/A and A/B tests reproduce the metric calculations of real A/A and A/B tests but with simulated offline randomizations. Further, in the simulated A/B tests, we simulate various scenarios of feature impact and evaluate power (1-false negative rate) of the A/B tests in those scenarios

The more we improve our A/B testing platform, the more we realize that devil is in the details. A systematic evaluation of the health of the A/B testing platform is paramount. Stay tuned for more studies in this area.

Thanks to Chengxi Shi, Song Wei, Todd Hodes, and Rob Resma for creating the A/A tests and to Ming Gong for making the image. We are also grateful to Pai Liu for his support, and Pavel Kochetkov, Lance Deng, Delia Mitchell for their feedback. Data scientists at Wish are passionate about building a trustworthy experimentation platform. If you are interested in solving challenging problems in this space, we are hiring for the data science team.

--

--