It’s rare to find a product manager who doesn’t talk about AB testing in some shape or form. It can be a polarising subject, either seen as the gold standard of validating a process, or creating so much confusion around how to interpret the results or what they are telling us that they paralyse the team. There are a wealth of tools on the market now, and with the introduction of Google Optimise it almost feels like it should be part of the mainstream toolkit for product development.
The reality, however, is that it is very rare to hear someone talk about using AB Testing well, especially within the product world. Successful examples are the exception rather than the rule and I’ve met more people who have been scared off by the complexity of this statistically heavy world, than those that have embraced it. There are specialists out there, but they are normally working within Conversion Rate Optimisation teams or as a subset of an analytics team.
It doesn’t have to be this way though. With a little bit of knowledge AB Testing can be an incredibly useful tool, you just need to understand a few basics. I wanted to share a few things that I have found key to implementing a successful AB test.
Types of Test
It helps to first understand the different sorts of tests that you can run. I’ve found a lot of different usages of these words, especially among newcomers to the field, to start with focus on the following three definitions:-
This is a test of one experience against another. Directly comparing two different user experiences. We are changing a set number of elements on each page consistently. Think of this as your classic Coke vs. Pepsi high street challenge. The customer isn’t aware that they are seeing a different variant. Normally we would use this sort of test to understand the change in behaviour of one experience against the status quo.
The same as the above test but instead of having two variants we have multiple variants. One famously huge version of this was Google’s testing of what blue to use on search results pages — not being able to decide they tested 40 different shades! Other tests might include different arrangements of buttons, form layouts or when you can’t decide between two competing designs. Beware, however, while testing multiple designs feels like a time saving , often it means you just need a lot more traffic (see the next section).
I’ve had a lot of people getting confused about this one and thinking that it is an A/B…N test. Multivariate means simply that we are testing multiple variables within one test. The difference is that we are trying to work out which parts of the test are making the change. For example, if you had three designs for a menu bar and three different layouts of a homepage, you could test them all at the same time to try and work out which ones work best together. You may also be able to discover that the menu bar makes little difference to the user over the homepage layout. This sounds really great, but in my experience there are two main problems:-
- You are actually running a lot more variations. In the above example you would have 9 different combinations that you are trying to test, and that gets very complicated. You also need a lot of traffic to be able to differentiate between the alternatives.
- The results are really hard to understand and interpret. It’s all very well to be able to pull the results from a testing tool onto a pretty slide, but at some point someone is going to ask what they mean!
My advice would be to avoid Multivariate tests until you’ve had a bit of experience with simpler AB tests and you’ve got good support and buy in from your organisation.
Getting results — the importance of traffic
I briefly mentioned traffic. One of the first things to understand about using tests to ‘prove’ a hypothesis is that you need traffic. Lots of it. We’ll go over a little bit of the reasons behind this shortly, but the summary is you need a lot of traffic because that’s how statistics work.
When you are using an AB testing tool you are actually using statistics to prove that something that you have changed within your product has changed the behaviour of your users and the outcomes that they reach. Often this is a metric like conversion or click through rate.
The problem is, the things we are changing often lead to only small changes in those outcomes. Changing the colour of a button might lead to an increase in click through rate, but it will probably only be in the order of a 5–10% change. Statistics tries to measure the probability of that change being caused by the change instead of being random variation. If there is a large change, then you might not need a lot of users to validate that the test is significant, because there is no overlap between the test and the control group, but it is more likely that the opposite will be the case.
Let’s take a couple of examples to illustrate this a bit better. Say that we decide to test a new route to get to work. Instead of taking a bus to get the last mile to work we are going to cycle. To test this we take two colleagues who both undertake this journey, and set them off at the same time every morning for the week. The bus takes 30 minutes, with the longest journey being 5 minutes longer and the shorter journey 5 minutes shorter. Cycling takes 10 minutes, with just a one minute variance either way. It is obvious in this case that cycling is quicker because it consistently was quicker than the bus.
But consider now an alternative set of results. Say on one day that the cyclist got a flat tyre and his journey took 40 minutes. Can you still be certain that it is quicker to cycle? How frequently should you expect to get a flat tyre? What about if the bus’ shortest journey is 5 minutes? Can you still be confident that the result is significant?
By sampling a larger set of data we can help estimate the impact of the random variation that we get within our experiments. It helps us to filter out the random events and flat tyres that exist in our lives and get to an estimation of the real impact. Or in other words, we can make sure that our results aren’t skewed by the 1 in 100 chance event.
A useful tool to test these ideas is Optimizely’s sample size calculator. This tool is really useful to help estimate the amount of traffic that you might need to get a significant result. You input the baseline metric, and then tell the tool what level of significance you want to be able to detect. It will then give you an estimate of how much traffic that you should plan for! Remember that this is a per experience / variation estimate, so if you are running two experiences then you need to double the number, three experiences means tripling the estimate. I use this tool very early in the planning process to decide whether I should even think about AB testing. If I’m looking to identify a 1–2% change to the baseline metric, then generally the answer is no because the traffic required is too high!
What to measure
Finding the right things to measure is as important as deciding on what to test, and the key to an effective measurement plan is to have a robust hypothesis. As a starting point I would suggest using a template to frame your hypothesis that looks something like:
If we … (What we are going to change)
Then … (Explain what you want to happen as a result of the change)
Because … (What it changes for the user)
The ‘then’ statement should capture the primary metric that you are looking to change, as well as defining the level of change you would expect to see. This becomes the defining point of a successful test — also known as your null hypothesis.
What do I mean by primary metric? Often this will be a measure of the number of people to complete the next step in a process. For example, testing changes on a product page of an e-commerce site, would probably result in a change in the number of people clicking on an add-to-basket button. Ideally it should be directly related to the change that you are making to product. So in the above example we wouldn’t use overall site conversion as our primary metric because there are too many other variables to take into account.
As another example, if you are looking to improve the impact of a homepage or landing page, then you might be looking at reducing the bounce rate, but not the overall revenue per visitor. It should show the direct causal result of the thing that you are trying to do.
Secondary metrics are also important, but they are exactly that — secondary. The further away they are from the thing you are testing the harder it is to prove that they are connected to the test that you are undertaking. I’ve also generally found that they will take a lot more traffic and time to get to a significant answer. They are great to complete the picture, and may help you to create other hypothesis for future testing but they shouldn’t be used to define whether a test variant is a winner or loser.
Let me give you an example. I once tested two different executions of search result page layouts. My hypothesis was something like this:
If we change the layout of the search results to show more product attributes then we will see an improvement in conversion rate because users will find it easier to identify the product that they want to purchase.
I ran the test and saw no impact on conversion rate. I did, however see a significant reduction in the number of visits that included a product page visit. That set off alarm bells, so we did some more digging into the results and noted that when a user did reach a product page they were more likely to add to basket than in the control group. The overall conversion rate showed no change, probably because the net impact was too small to detect. What was the result? It was probably best defined as inconclusive.
If I were to run the test again I would change the hypothesis to more closely relate to the customer experience. Something like this:
If we change the layout of the search result page then we will increase the proportion of customers finding the right product first time, because users will be able to select appropriate products earlier.
In this example I would be measuring the proportion of customers who visited a product page and added that product to basket. I’d also be keeping an eye on the overall impact of the change by looking at the total site to basket rate. Why? Perhaps I am displaying the wrong information on the results page. I might be listing colour when size or another specification is more important. By not showing it at all and encouraging users to visit the full product page I might be able to convince some of my users to purchase the product through copy and additional pictures and maybe reviews.
A losing test isn’t a bad result
In the search results example I got an insignificant result, but that isn’t necessarily a bad thing. Remember — you are running a test as an experiment. You are trying to understand why something is happening. More often than not your tests will come back with an insignificant result. That isn’t a bad thing — it is just telling you that the test didn’t make a significant difference. You can then go back and try to understand what happened by looking at the data in more detail, including the secondary metrics.
Sometimes an insignificant result can also be seen as a successful result! For example, if you implement stronger security rules on a password field and it shows no drop-off of users completing the registration process then you can say that you have increased security without impacting conversion rate!
Likewise a negative result isn’t necessarily a bad thing either — all it is saying is that the experience didn’t make the change that was expected. That means that you need to go back to the drawing board, but think of it from a business point of view. Through a simple test you have avoided building functionality into the system that would have had a negative effect on the business!
Remember, more often than not your tests will be inconclusive, especially as you start to use AB tests for the first time. There are many factors that can impact a customer’s decisions, you are testing because you want to try and put some data behind those decisions. My one piece of overriding advice is to ask yourself whether the thing you are testing matters to users. Google may have the luxury (and the traffic) to test 40 shades of blue, but is it really the most important thing that will make a difference to your users?
Like anything in the product world, prioritisation is key. Don’t fall into the trap of testing everything first, if the feature or change is an absolute no-brainer, and you’ve validated directly with users, then there is little value in using AB testing to further prove the point. It is in the more marginal decisions that you can really get value from using testing, maybe there is a split decision in your user research, or two HIPPOs who favour different solutions. It is here that I think tools like this can really help to resolve issues.
Thinking about these three items will give you a head start in getting an AB Test up and running. Divorcing yourself from the outcome, making sure you run the right test and have the right amount of traffic, all make for a great starting point. I’ve found some of the posts at the Conversion Sciences Blog to be very helpful in getting my head around some of the more complicated parts. The big AB testing companies also have a lot of blog articles and videos that can help you get started — just remember that they have they are probably going to try and sell you their products!