Don’t trust A/B testing
First I should make something clear — I am a big fan of A/B testing. It is a great way of iteratively improving your website/application and increase your revenue.
However I think we tend to misuse it. We believe A/B testing can solve all our business and design issues and we rely on its results without question.
A/B testing is not perfect and nor are the tools running it.
Yo-yo effect is a well-known example. Imagine you have just finished an A/B test. A challenger experience won with 7% lift in a conversion rate. “Statistically significant”, your testing tool claims. You stop the test and implement the challenger experience. A week later you open your web analytics tool and just can’t see the promised 7% lift. It is flat or even worse, it is decreasing. And you’re wondering why.
Does it sound familiar to you?
In this post I’d like to cover the key challenges of A/B testing which we often ignore or underestimate. Those challenges often cause bad decisions which eventually lead to worse business performance.
After reading the post I’d like you being more thoughtful in how you use A/B testing. To understand its imperfections, a correct use of the method, and to better analyze results.
Let me start with one the most often pitfalls which lead to incorrect decisions: your effort to reach statistical significance results of your A/B tests.
Statistical Significance
When you’re considering the statistical significance of your test results, don’t trust the A/B testing tool itself. From my experience they tend to call off a statistically significant winner too early.
Rather, double-check your results in a good A/B test calculator. My favourite one is this one from AB Testguide. One of the few show results as they truly are: as intervals. Most of the tools simplify it and display conversion rates as one number (e.g 2.7%). But in reality it is an interval = e.g. 2.7%±0.8%. Both the mean value and the size of the interval matter. You can be certain that a challenger experience is better when its interval doesn’t interfere with the interval of a default experience.
The example below shows that there is only a small chance the B experience with 2.7% conversion (mean value) rate would be worse than the A experience. Both intervals almost don’t interfere and there is only 1% chance that B is worse than A.

Statistical significance of your tests is only one of the key aspects of correctly finished tests. Next to that, you should get enough conversions (at least 200 per experience), run it for 1 or 2 business cycles and have stable results.
Seasonality & Traffic Mix
Your A/B test results are based on the traffic which entered the test. Don’t forget it. If you have a seasonal business, you most likely have different types of audiences in different time periods. A common example would be your winter A/B tests results not necessarily being applicable during the summer.
Same applies to your traffic mix. Aim to A/B test in time periods of the broadest traffic mix. Avoid testing during Christmas, Black Friday, and sudden economics slumps etc. People behave differently and the results you get might not be valid for the rest of the year.
Rather, do re-tests for different segments and in different time periods so you can be sure you made a right decision.
Cookies
Most A/B testing tools are based on cookies. And it means a lot of potential issues. A prospect who sees a challenger’s copy might really like it and it does persuade her to buy your product. But she is in the office and does not have time to make an order now. When she finally makes it home, she sits at her computer and makes an order. But at home she has a different computer, with different browser, and therefore is possibly seeing a different experience (e.g. a default experience). So the conversion is assigned to the default experience, even though it was the challenger experience which persuaded her.

Nowadays this is even more urgent with mobile. Most of us operate multiple devices. Reading about a product on iPad, checking its price/stocks availability on mobile and finally making an order on a PC or Mac. Unfortunately current tools on the market are not capable of having persistent testing experience in all customers’ devices. I believe that this will be improved in the future as the tools find a way how to display same experience in all devices.
But today A/B testing is very challenging in a multi device world. You can isolate your tests only to a specific device, but that doesn’t solve entirely the issue. You can collect massive amount of data to minimize the risks. But how big is massive? Is a test running so long still valid? Different seasons, expired cookies are on the table.
ROPO Effect
Do you still have a lot of customers who research online and purchase offline? Then you have another challenge to fight when it comes to A/B testing.
It is similar to the multi-device example. Your bricks&mortar store is another device which customers can use in different stages of their buying decision process.
If this issue is relevant for you, I recommend having different discount coupons for your customers so you can track which experiences drive them to the physical store.
Next to that you can benefit from having different phone numbers in your testing experiences so you can again recognise what experience led to the call.
Long Buying Decision Process
This issue happens often, though it is ignored in numerous cases. Find out (either by using analytics or talking to your customers) how long your customers’ buying decision process is. It makes a lot differences whether it is one week or 4 months. I’ll explain why.
The shorter the decision making process is the easier you have when it comes to A/B testing.
Imagine you’re selling mortgages online. That’s usually quite a long decision making process. A mortgage is a serious thing so it often takes several months to chose the right one. When you’re testing your main landing page you must know that you will get traffic from various spectra of customers. Some might be just starting their research and some might be almost decided.
If you A/B test a landing page for 2 weeks you just get the already decided customers. So your test might tell you anything about the specific element you’re testing. People who were persuaded by it will convert in 2 months or later.
Be aware of it, set the right expectations and goals. If your business requires a long time for customers to decide; test on a level of micro-conversions (downloading a white paper, filling out a lead form etc.).
Optimising for Customer Lifetime Value
If you’re running an E-commerce business (and not only that) you don’t simply want get more customers. You actually want to get more excellent customers. Those who will be loyal, make repeat purchases and recommend your product to their friends.
Imagine that your latest A/B test shows that a challenger experience drives 15% more customers. That’s not bad. But what if those extra 15% were bad customers who don’t buy again and don’t recommend you further? What if the default experience drives less but higher quality customers? Wouldn’t it be better to stay with the default experience?
If you have long-term goals in your mind, it certainly would. But you probably ask now how you can recognise an excellent customer from a bad customer in an A/B test.
It is a challenge. But what you can do today is to include a “recommendation likelihood” survey to your post-purchase email and then measure it per testing experience. It doesn’t automatically means that high recommendation ratio equals better customers but it is a trait.
In a 3-month period check if customers from a challenger experience and from a default experience made repeat purchases. If so, what did they buy and for how much? Don’t be afraid to change conclusions of a 3-month old A/B test if you can see the default experience was driving more valuable customers!
Make better decisions
The goal of the post wasn’t to discourage you from A/B testing. The point is to make you more aware of the less known pitfalls of A/B testing. So next time think profoundly about your tests, its results and make better decisions.
That’s essentially the main purpose of it = to make better decisions. Go beyond tools and basic reports. Add the specific details of your business, your experience and knowledge. Before launching a test make sure you did your best in setup so you can make a right decision afterwards.
If you’ve enjoyed reading the post, please hit the “Recommend” button or share it further so more people become more thoughtful in how to use A/B testing. Thank you!
PS I: Thanks to Craig Taylor for valuable comments in a process of writing.
PS II: Thanks to Martin Snizek for a coffee chat which sparked the idea of this article.