A Product Manager’s notes on A/B testing
“There are three kinds of lies: lies, damned lies and statistics.”
— Popularised by Mark Twain
In tech companies, this criticism is reserved almost exclusively for results of A/B tests and other ‘scientific’ experiments. Product Managers, who usually present these results, need to know how these tests work; lest they be mistaken for liars (or damned liars).
This article is my attempt at digging into the details of A/B tests and explaining why some of these details should not be taken for granted. Over the last couple of years, I have run and reviewed dozens of A/B tests and I have been trying to understand why some of these experiments add a lot of value, while others don’t.
Note: This is not exactly an introductory article on A/B tests. I have attached some resources at the end if you need to learn or revise some of the main concepts associated with A/B tests.
Defining the ideal A/B test
“What’s the perfect A/B test?”. My boss had asked me this question as I was trying to define a central process for A/B testing across the startup I used to work for. The simplest answer is this:
The perfect A/B test produces conclusive results that allow us to correctly validate or discard a hypothesis.
Like most things in life, this is easier said than done. If you’ve run actual experiments, you would’ve faced issues with pretty much every part of the process. Everything will go worng.
Before we get into the details of why they do go wrong, you’ll need to think about how important a truly scientific result is to you.
“So what if the results aren’t clean?” is NOT an outrageous question. In the interest of time, resources and expertise, you might choose to overlook some of the minutiae associated with these experiments.
After engaging in a lot of existential musing about these tests, I arrived upon a conclusion for myself: Understand whether proper scientific results would be key to your decision making. If directional results or pre-post analyses are enough, then don’t go through the stress of a proper A/B test. But if even small variations from experiment results might affect your decision making, then don’t spare a single detail.
And now on to those details…
Issues with experiments and trying to deal with them
For the rest of this article, we’ll be going through a semi-fictional story. Any resemblance to real persons or real-life entities is mostly (at a 95% significance level) coincidental.
It’s a lovely Friday. You had worked tirelessly with your designer and your engineers to redesign the checkout page of your e-commerce app. You were all sold on how this would improve conversion. But since you had to validate your claims, you had started an A/B test two months ago.
Today, your analyst has sent you a report around how the redesign has improved conversion (# of orders placed/# of checkout sessions) by 5%! It’s a huge win for the team. You quickly type out a long mail where you brag about the impact and thank everyone for their relentless efforts.
You then decide to launch your feature for all users and head back home with a really smug look on your face.
After chilling over the whole weekend, you check your mail on Monday morning to find a reply to your ‘victory speech’:
I thought you said conversion would improve by 5%? It has barely improved by 3% after the launch!
— From the Head of Marketing, whose costs haven’t reduced as much as you had promised
By now, you have realised that experiment results don’t translate directly to real world impact. In fact, the gap between these two numbers is caused by how different your experiment setup is when compared to the real world, or in statistical terms — how different your sample is when compared to the population.
It’s not trivial to make the experiment setup mimic the real world. But when your experiment setup is not even close to the real world, your results will not have any meaning. A few tips on avoiding this situation:
Choosing the right Test and Control groups
I like to focus on two aspects of these choices:
- Distributions of important attributes across Test, Control and the actual population: What is the split between New and Existing users in Test and Control groups? How different is this from your active user base? New v Existing matters because their conversions are drastically different. Similarly, if you’re working with a mobile app, the distribution across device types and operating systems will matter. Stratified sampling is one technique that might help you overcome these disparities.
- Pre-existing biases: Did your Test group convert better than your Control group even before the experiment began? This might seem unlikely but I have been surprised by the number of times this ended up being true. Try to change your Test and Control groups or account for this bias in your analysis once the test concludes.
Underplaying your results
Despite your best efforts, it’s still nearly impossible to guarantee exact impact in the real world. There will always be some factors that are not controlled for, which would be extremely tedious to analyse. So choose the easy way out: Don’t over-promise. If the experiment results indicate a 5% increase in conversion, set the expectation that the actual increase would be in the ballpark of this number (let’s face it, it’ll usually be lesser) and that the whole impact will not be realised.
Now that you have successfully found the root cause of the issue and clarified the same with your Head of Marketing, you want to relax. But of course that’s bound to be ruined by another reply on the same thread:
Why did the experiment not conclude on time? What took you so long?
— From your boss, who has suddenly noticed that you had promised to wrap up the experiment within two weeks
Timing is everything.
You don’t want to be running experiments for more than the exact amount of time needed to conclude them significantly. Don’t rush and declare results prematurely.
More importantly: don’t drag out the experiment just because you’re not getting the results that favour you.
Here are some reasons why we might be getting the timing wrong:
Overestimating possible impact
If you expect your feature to work really well, the experiment needn’t run for that long. What this means is that you should NOT expect your feature to work really well and hence buy yourself enough time to run the experiment.
You can play around with this significance calculator to better understand how expected change affects the experiment duration.
Not cutting your losses in time
If you’re able to statistically conclude that your feature isn’t working, STOP. Remember that significance is a mathematical concept, not a state of mind.
It’s okay if your feature doesn’t work. It’s not okay if you don’t realise this in time to cut your losses. Stop praying for miracles. Stop funding the concorde.
After explaining why the experiment took too long, you’re almost waiting for someone else to complain about your results. This time, it’s a peer. What a great start to the week!
You have improved conversion by 5% but the cancellation rate has increased by 2%. I hate you.
— From a fellow PM, whose graph you’ve screwed over
Remember that the experiment is not just about your metric.
It’s possible to increase conversion by promising free Netflix subscriptions on the homepage though you have no way of giving them out. This would impact every other metric in a negative manner, from retention to calls received by your support team. This is a terrible example, but the lesson here is to make sure all affected metrics are included in your experiment.
After promising to dive deeper into whether your changes actually affected the cancellation rate, you stumble upon a new mystery:
Conversion for Bengaluru users has not changed at all. Do you hate building for South Indians?
— From your Business Lead, who religiously tracks metrics for a single city
This is a classic misunderstanding of significance, not favouritism for certain geographies. Some more details:
Significance at a granular level
You concluded your experiment at a certain significance level (usually 95%) for users across all cities. This does not mean that your results are significant at a city level. To obtain significance at more granular levels, you’ll need to run the experiment for longer. Before declaring your results, understand which parts are significant and which ones aren’t.
Biases at a granular level
While your overall test and control groups might be unbiased, it’s entirely possible that users in Bengaluru were predominantly Existing users in your experiment alone. Although it’s not possible to verify these biases for every smaller group, you can still check for only the ones that behave differently from the rest of the groups.
And that’s the end of this story. After having responded to these mails, you may go on to live happily ever after until you can think of another hypothesis that is worth testing.
In conclusion
I’m 99% sure (enough to conclude) that there are more issues to be found with the A/B tests that are being run everyday, even through third-party tools. These are only some of the mistakes I have made and observed others making.
If there’s one thing you should take away from this article, it is this: do NOT take any details for granted. Question everything about the experiments you conduct and the ones you review.
While this will be tedious, it will take you closer to the truth. Or atleast away from damned lies. Happy testing!
Resources
Comparison with multivariate testing
Evan Miller’s take on A/B testing mistakes. You should check out his entire blog, which has a lot of really cool resources on A/B testing.
Significance calculators: Conversion specific, Generic, For averages
How to lie with statistics: A classic book that has remained relevant for over six decades!