11 Worst A/B Testing Mistakes According to Experts

A/B testing is bread and butter of many UX designers and conversion rate optimization experts.

They advocate their effectiveness, claiming A/B’s are the best approach when improving one’s website.

Because of that, most website and business owners jump on the bandwagon and start tests themselves, hoping that they will end up with a perfect site as a result.

The problem is, A/B testing needs to follow certain principles in order to be effective. Failing to do so results in disasters.

We contacted several UX and Conversion Rate Optimization Experts such as Jeff Sauro, Keith Hagen, Scott Belsky, Andy Budd, Brian Massey, Peep Laja and Paul Olyslager and asked them one question: what are the worst a/b testing mistakes according to them.

Are you guilty of any?

Not using A/B testing

A/B testing is the best way to determine whether your design proposition is actually beneficial for your business goals. After all, we can’t just go and redesign the pages on a whim, even if we have hard data backing up our decisions — we might make a mistake while transforming the data into a change.

A/B testing is putting a portion of traffic into a scientific funnel for our purposes — we are a subject of such test all the time, after all (Facebook, wink wink).

Jeff Sauro from MeasuringU, the author of Customer Analytics for Dummies explains why not testing is the greatest mistake one could make:

“Biggest issue with A/B testing is NOT A/B testing. Despite the tools and press, it still takes a change in thinking and process to get those experiments running.
Most of problems are minor compared to the mistake of not using essentially a randomized control trial to improve the user experience. It’s the most solid tool researchers in ANY discipline have to establish causation — and we can implement it essentially for free.”

Missing a hypothesis

Testing is good, but knowing what to test is even better. The best case scenario is when you know why you want to test.

Establishing a hypothesis based on data gathered through analytics, be it statistical of visual, can avoid a major disaster. Changing designs on a whim is basically a sin.

Brian Massey from Conversion Scientist shared with us a very tasty piece of his thoughts on this matter (warning: do not read before lunch):

“It is sad to see burgeoning conversion optimization efforts die an early death. They die of malnutrition, wasting away until support for them is all dried up.
What nourishes a CRO program? Hypotheses. Like the food we eat, hypotheses can be nourishing and life-giving. They can also be fast food, filling an empty stomach but leaving the body malnourished.
What are the vegetables of hypotheses? What does a balanced diet look like? How can you avoid the junk food of conversion ideas?
Nutritional hypotheses are not invented by your boss or other company executives. They will put a soda machine in your proverbial elementary school. They are not the first thing that comes to your mind. You are not some binge eater bored with a commercial break from their favorite afternoon soap opera.
Nutritional hypotheses take hard work to select and prepare. They not only give you a lift in conversion rate, but tell you something about your visitors, something secret and important, something your competitors don’t know.
You don’t live in a hypothesis desert. You have the tools for selecting hypotheses: analytics, online testing tools, click tracking tools, session recording tools. Use them to find meaningful hypotheses that will nourish your conversion optimization program and your business as well. “

Paweł Ogonowski from Conversion.pl provides a quick how-to:

Before you start a test form a great hypothesis. In order to form it, conduct a solid analysis using your quantitative data and visual analytics, add some insights from customers with usability testing and surveys, and top it with heuristic analysis. Only then can you come up with great hypothesis for your next test!“

Testing things that are not worth it

The knee-jerk reaction of every fresh A/B tester is to test absolutely everything.

After all, even a small change can skyrocket the stats — we don’t know until we try, right? The example of Hubspot where they changed the button from green to red is one quoted the most. All they had to do was change the color and voila, 21% more conversions.

Firstly, the “red is better than green” myth has already been debunked in one of my articles about how contrast rules perception. Secondly, you need to understand which elements influence your visitors and what is good or bad. Sometimes a design is not worth A/B testing at all — it’s better to scrap it altogether.

Andy Budd, a User Experience designer has a good point:

“While most tools are prone to over use or even abuse, few have received more criticism than A/B testing. Often used to outsource important design decisions, companies lacking a strong design leadership will try to A/B test their way to good design.
What you end up with is something called a “local maxima,” where you make a bad design the best it can be, while missing the opportunity to make something significantly better. As such, A/B testing needs to be used as part of a balanced research diet, rather than a crutch for nervous product managers and inexperienced designers.”

Establishing wrong goals

How would you judge a test to be successful? Just because it appeals to your tastes or your manager says he thinks it’s better now?

You need to establish a proper benchmark for your test, i.e. a metric that will determine the effectiveness.

Paul Olyslager, a UX designer at Nu3:

“It is easy to take conversion rate as your ultimate goal when running an eCommerce website, because that is the ultimate goal.
A/B testing is about progressively testing hypothesis. Try to select a more precise goal that matches your hypothesis.
If you believe that a certain change in the interface would make it easier to add a product to the cart, your goal is the average basket size or the amount of clicks on the “Add to Cart” button.
Of course, you should add conversion rate as a goal as well but use it as an indicator instead of the deciding factor.”

Paweł Ogonowski from Conversion.pl:

“Not every test result reaches minimal sample size and statistical significance. Even though people do conduct tests, they do not check the results properly, which results in improper conclusions. “

Match your goals with the hypothesis — make the goals local in terms of changes.

Waiting too long to end a test

We are scientists and A/B tests are science.

In order to choose a winner, we wait for quite some time to get enough confidence. That’s something we all well know and do.

However, the problem is that most people are obsessing over the idea that you should wait for that magic >95% to happen before claiming that your test is successful.

True, we cannot call tests too early, but we shouldn’t wait for them too long either.

Jeff Sauro from MeasuringU shared with us this piece of insight:

“Trying to get the confidence too high — when the consequences of being wrong are the user gets the same design, it’s usually not necessary to insist on >95% confidence.
If after a week you have 87% confidence, go with the winner rather than waiting for an arbitrarily high level of confidence (unless of course you have a good reason to insist on high confidence)”

Waiting too long is just a waste of time if your test is performing well enough over a longer period of time (not three days, but a week at least!).

Paul Olyslager follows the same principle:

“Personally, I try to run the A/B test until its results were statistical significant. If the experiment takes too much time to become statistical significant, we immediately stop it.
I know of some big companies, with which I came in contact through the years, use this as an indicator and rarely wait until the 95%.”

The takeaway is that you should remember A/B tests are science and you need hard data to make decisions about them, but do not take it overboard — the magic 95% might not happen for a long time, thus wasting your focus.

Focusing on statistical significance only

In vein of the previous point, statistical significance is just one of the factors that you need to consider when testing. The other is practical significance:

Focusing on statistical significance instead of practical significance. If you have a large enough sample — and A/B tests usually get them — most of your studies will get you statistical significance.
But statistically significant might mean just a .01% increase in click-through rates. On sites like Wikipedia that may matter when the traffic is huge, but for many other sites, the difference is imperceptible on the bottom line.
Even worse, you’ve wasted time on one low yield study instead of running a more fruitful one.”

Not running full cycles

Another mistake that is a nail to the coffin is running your tests not in full cycles — be it weeks, months or, that’s extreme, quarters.

Peep Laja from ConversionXL discusses why not sticking to full cycles is wrong:

Did you start the test on Monday? Then you need to end it on a Monday as well. Why? Because your conversion rate can vary greatly depending on the day of the week.
If we [don’t] test for full weeks, the results [can] be inaccurate. So this is what you must always do: run tests for 7 days at a time. If confidence is not achieved within the first 7 days, run it another 7 days. If it’s not achieved with 14 days, run it another 7 days.”

The calendar for every person is a completely different entity, which means your Monday is much different than mine. Same goes for the other days of the week. It’s all because of holidays, national events, world tragedies or team meetings.

Paul Olyslager:

“Run your test in weekly cycles. You smooth out short term peaks and dips. “

Including returning visitors in a test

Just like the day of the week influences the experiences on the website, so do the returning visits.

Keith Hagen, Co-Founder of ConversionIQ explains why it is so:

The worst mistake people commit when A/B testing is including returning visitors into a test.
This inclusion essentially changes a visitor’s experience and skews the results, especially at the beginning of a test.
This mistake will compound if the test accepts early results as valid. It’s critical to only include new site visitors into a test. Otherwise, you are testing wrong.”

Running too many tests at once

Scott Belsky explains what hides behind this term:

The worst mistake people make when A/B testing is testing more than one change at once, and then falsely attributing performance to the wrong change.”

The good thing about running few test at once is that it saves time.

The bad thing it completely destroys the results of all tests. Remember, A/Bs should be treated separately, so you need to remember that an A/B test on the front page and then on the pricing page will influence the browsing experience of a visitor.

Peep Laja from ConversionXL confirms that this may be only effective when you expect the tests to interact strongly with each other, but otherwise it’s better to stay away from too many tests at once.

Relying only on client-side testing tools

Paul Olyslager raised a very interesting point when discussing the effectiveness of the tools used for A/B testing:

A lot of A/B testing tools on the market work client side. With the help of a WYSIWYG, CSS and Javascript editor, you can put your A/B test together. Although there isn’t anything wrong with such tools, my personal experience is that they never live up to the expectations.
You always see a ‘glitch’. The website is being loaded, after which the Javascript and CSS is being injected. You can minimize the glitch with a few tricks, but it is always there. Bigger A/B tests require a lot of CSS and JS changes. This doesn’t only cause problems in terms of performance, but it can become very complex to handle and therefore prone to error. If you add, for example, responsive design into the equation, it soon becomes a complete mess, no matter what you do to structure your code. If your website is using server-side scripting language, such as PHP, most of your JavaScript changes become useless since it needs to be rewritten into PHP.
Server-side A/B testing has a lot of advantages. For starters, you don’t have the glitch I talked about earlier because you are pushing your visitors to two different test pages. You don’t have to rewrite a few thousand lines of code. Once the test results are in, you can deploy the wining variation and its code in matter of minutes.
Think about the workload you would save.”

The glitches are abundant, that is true. If you want to test complicated changes, be sure to optimize it properly so your test will not end as a disaster. If you want to test something small, however, then client-side tools are pretty much safe.

You do not turn data into decisions

See, this is the final step and it is the most important one — analyzing your data. Once you finish your A/B test, you must be one hundred percent sure that you know why the test won or lost.

Getting only average information about your overall traffic will not answer these questions. You need to segment your traffic according to your target audience. You may do that through sending your data to Google Analytics and setting up custom variables.

Bart Mozyrko from UsabilityTools explains how to enhance that data even further:

The biggest mistake is launching and finishing AB tests without supporting your decisions on solid data on user interactions with your website or product.
Your AB test is only as good as your hypothesis.
When you test, you need to ask yourself several questions:
How do users browse from page to page? What behaviors are tied to conversions or customer struggle?
How do you empathize with your users to see what’s working and what’s not, and most importantly, why?
Sometimes I’m blown away by stories of my colleagues. But, in most cases your first tests don’t bring a desired outcome. Launching new ones without understanding what went wrong in the first place is a waste of time and precious traffic. That’s why we created UsabilityTools.

Final word

As you can see, there are plenty of ways in which you can mess up your A/B testing.

It ain’t easy being a researcher like that. You need to have a full understanding of what happens and why it happens before making a decision to declare a winner in your test. Hopefully these points will make you better in your testing and you won’t make any more mishaps in the future.

Go away and test — that’s the only way towards improvement.


Originally published at blog.usabilitytools.com on December 2, 2015.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.