The Art of the A/B Test: Statistics 101 for UX Designers & UX Researchers
It seems like there is an increased desire for “T-shaped” people when hiring for designer roles. Companies are increasingly looking for candidates who have a strong expertise in at least part of the design process, and a broad understanding of everything.
So you’ve honed your craft, you’ve tried out a couple of new frameworks, played with the latest prototyping software, and doubled-down on your double diamond process. Then you get to an interview and you get a question that stumps you: “How do you know when to call an A/B test?”
Forgive me, dear reader, if you are in fact a split-testing prodigy. In that case, this post is likely not for you. However, I bring up this scenario because I have been on the other side of the hiring table when a candidate becomes flustered trying to answer this exact question. My company deeply believes in balancing qualitative research with quantitative research, and we want designers and researchers who are able to work in both areas. From my own experience hiring, we see a lot of people who know their stuff on the qualitative side: they know how to run a non-leading and empathetic user interview or conduct a lean and effective guerrilla usability testing session; but when it comes to the quantitative side, their knowledge can be a little thin. To any of you who haven’t had sufficient experience with A/B testing before, and the statistical methods used to evaluate split tests, this post is for you! Ready? Let’s dive in.
Note: Analysis of an A/B test is only one of many quantitative analysis tools available to researchers and designers, but it is the only one this article will discuss.
What is an A/B test?
Let’s start simply and define what an A/B test is and why we might want to use one. An A/B test (or split test) is a test in which we show users two or more variations of a design and measure some metric to test a hypothesis. It is called an A/B test because the most simple version involves two variants: Variant A, and Variant B.
An A/B test (or split test) is a test in which we show users two or more variations of a design and measure some metric to test a hypothesis.
Let’s pretend you are a UX designer hired by Ollivander’s Wands for Witches and Wizards. They have hired you to optimize their new mobile e-commerce site, and want you to start by trying to get more customers adding things to their cart. You create a user journey map and identify the product page as a great place to start. After several interviews with users who match the persona you are designing for, you create the hypothesis that adding testimonials to the product page will help shoppers alleviate their concerns about whether a particular wand will work for them. Here are two designs you would like to test against each other:
Let’s establish some basic nomenclature we’ll use from here on out. The design on the left is our control variant (or just control), the design on the right is our test variant (or just test).
Setting up your first test
Great! Now it is time to set up the test. Usually, the tool used to manage A/B tests will already be established in your organization, but if you’re pushing split testing as a new pursuit within your company (you brave designer, you), then some common options are Optimizely and Google Analytics.
Ok, so what are we trying to learn? Determining the right metrics to monitor in advance is key, otherwise, we will find ourselves swayed by potentially irrelevant data when we try to call the test later on. For this test, we want to measure the number of people who click the “Buy Now” button, since that is the user action that we believe can be influenced by showing a testimonial. We could also choose the number of people who make a purchase as the metric we measure, but that will make us have to run the test longer (more on that later). This is where I’m going to introduce my first ugly statistics term, which will likely give you painful flashbacks to tenth-grade math class: in statistics when running an A/B test, we are aiming to disprove our null hypothesis. What the heck is that!? It sounds like a cheesy 80’s sci-fi flick. A null hypothesis is simply a hypothesis that our test variant is no better than our control variant. With the split test we are setting up, we aim to disprove the null hypothesis, by showing that the test variant is a statistically significant improvement over the control. In fact, your real-world hypothesis is always the opposite of the null hypothesis, so it is sometimes called the alternative hypothesis. So in this example, our null hypothesis is no more people will click on the call to action in the test than in the control, and our alternative hypothesis is that more people will click on the variant with the testimonial than without. Do we need to formally state our null and alternative hypotheses like this every time we set up a test? Not really. But if you’re ever speaking to an optimization manager or a data scientist, they may throw this term into the conversation, so it is worth knowing (plus you’ll get bonus marks in your next interview if you use them correctly in a sentence).
A null hypothesis is simply a hypothesis that our test variant is no better than our control variant. With the split test we are setting up, we aim to disprove the null hypothesis, by showing that the test variant is a statistically significant improvement over the control.
So how should we split up the traffic? 50/50? 80/20? 99/1? Well, it depends. There are two competing priorities here: time to completion of the test and impact to your business while the test is running. If you give each variant 50% of traffic, you are going to minimize the amount of time it takes to find out whether your test is a winner or not, but you’re also exposing your business to the maximum amount of risk — if the new variant performs worse (and most of them will), you’re going to average down your site’s performance during the duration of the test¹. Let’s be really conservative then! Show the variant to 10% or 1% of traffic and we’ll minimize the impact, right? Sure, but it could take you years to prove or disprove your hypothesis. There’s no hard and fast rule for this one, you have to weigh your testing priorities with business priorities and find out what works for you. That said, most designers will find they need to run tests on at least 25% of traffic to have a test complete in any appreciable amount of time unless you’re working on site with a ton of traffic. Note: there are testing strategies like the multi-armed bandit which try to minimize the negative impact and testing time at once, but they are beyond the scope of this article.
Two more thoughts on splitting your traffic: you need to make sure the split is random, and you need to make sure the variant shown to any given user is consistent. Showing one treatment to users on Chrome, and the other treatment to users on Safari won’t tell you anything useful, the two groups you are comparing have to be randomly selected, otherwise, biases in their behavior can skew your results. And if the user visits Ollivander’s Wands website from work, sees the version with the testimonial, then waits until they are home to actually complete the order and sees the version without the testimonial, it will confuse the user and muddle your data. Commercial tools like Optimizely and Google Analytics will handle both of these considerations for you, so you only need to worry about it if you’re building your own split testing system.
Reading the Results
Let’s say you’ve set up your test to be split 50/50 between your test and control variants. You’ve tested both variants to make sure everything works properly (you wouldn’t believe how many tests have to be restarted because the test variant is broken in some way and gives you unusable data), and you set the test live. Congrats! This is now the fun part. Every day you get to come in and check the progress of your test while sipping on a delicious cappuccino. If you’re using a tool like Optimizely or Google Analytics to run your test, they will usually tell you when your test has reached significance: when your data has proven statistically that your null hypothesis is false, that your test variant is better or worse than the control. We’re going to discuss what is going on under the hood of the popular split testing tools, so you will:
- Better know how to watch out for issues;
- Better able to explain the results to stakeholders or teammates;
- And be able to call a test if your company is using an internal tool that doesn’t automatically calculate if a test is significant.
Time for a quick stroll down memory lane. Remember this funny looking graph? The one that your math teacher Mrs. Harris insisted you would be extremely important when you graduated, but you couldn’t quite bring yourself to believe her? Well, if you’ve read this far, I guess she’s had the last laugh because she was right. This is called a Bell curve, and it represents something called the normal distribution, or Gaussian distribution. They mean essentially the same thing: let’s say you’re measuring the height of everyone in a city (we’re really tickling something deep in your memory now, aren’t we?), the average, or mean, height is represented by the vertical line that splits the bell curve in half, and the distribution of every other height measured falls under the curve. What does the shape of the curve tell you? It means that as you get farther from the mean height, the fewer individuals there are with that height. This makes intuitive sense: you see lots of men on your daily commute in the range of 5' 7" to 6' 1", but not that many who are under 5' or over 7' (the average height of American males is about 5' 9"). It turns out many different measurements of the natural world follow a distribution curve like this one, including visitors to your site completing a particular action! When we highlight a particular area under the curve, like the darker region in the figure above, we are calling out a particular percentage of the population, in this case, 95%. The remaining 5% is split evenly between the lighter blue areas in the left and right arms of the curve.
I know what you’re thinking: you’re starting to sound like Mrs. Harris, what does this have to do with anything and why is it useful?
Let’s jump back to the Ollivander’s Wands example. You’ve checked your data after two days of testing and see the following results:
Sweet! You’re done! You’ve managed to roll out a design change that results in an absolute 0.86% boost! Great success!
Hold on there amigo, not so fast. How do we know the effect we’re seeing is representative of the whole population — i.e. all of your traffic? How do we know that the results you’re seeing today aren’t just a random lucky streak for the test variant? Let’s jump back to bell curves.
Let’s say the results from the control variant are represented in the graphic above by the blue Bell curve, and the test variant is represented by the red Bell curve. The lines down the middle represent our mean click rates for each group, 10.79%, and 11.65%, respectively. See the area where the two Bell curves overlap? This represents the probability that the improvement we are seeing is just chance, and that the test is not an improvement over the control. We want to minimize this area so that we can be sure that the effect we’re seeing is real, and not just due to chance.
The area of uncertainty will never equal 0% unless we run the test forever, and who has time for that?
So how small should that area of overlap be? It turns out there is no single right answer. The area of uncertainty will never equal 0% unless we run the test forever, and who has time for that? So we have to pick a percentage we are comfortable with. I most often shoot for 5% or less, but I’ve seen others use 10%, or even 1% if the risk of being wrong is very costly (think of industries where the cost of being wrong is very high, like healthcare). Ok, so if we choose a 5% or less chance that we are wrong, that means we are 95% sure the test is better than the control.² This is called your confidence interval. It is much more common to talk about your confidence interval than to talk about the chance you are wrong, so we’ll use that number from here on out. So what is our confidence interval for the numbers in the table above after two days of testing? I use my favorite A/B testing calculator to find out.
Plugging in the numbers from the table (protip: your control numbers go in the Visitors A and Conversions A inputs), and choosing a confidence interval of 95%, the calculator shows me right away that the test is not significant. But how close am I to 95% confidence? The answer to this is hidden under a new term: p-value. For this test, it is showing a value of .1462, or 14.62%. This number corresponds to the area of overlap of the two Bell curves we talked about above, or the chance that the effect you’re seeing is from pure chance. So our confidence interval at the moment is 100%-14.62% = 85.38%. Not bad, but I’d prefer more confidence before rolling this change out to all my traffic.
Notice how we left the hypothesis type in the calculator settings as one-sided? What is that all about? Well in most split tests we do for the web, we generally are interested to see if our test is better than the control, and if it isn’t we don’t really care if it is worse or just the same — we know the direction we’re interested in proving, and we will probably discard the test variant unless it is significantly better. In some situations, you might be interested to see whether a test is better or worse, and you don’t know which direction it will go. Let’s say as an example we are swapping all our PNG icons for vectors, and we want to make sure that there is no effect on our conversion rates, we might prefer to run a two-sided test to detect a change in either direction. The drawback is that we have to split our uncertainty (overlapping areas of the curve) in two directions, so we will have to run a test longer than if we only care about one direction. Most of the time, you can just stick with a one-sided test, as long as you’re just trying to show that the test variant is better.
A note about Sample Size
Two days later, we have updated results, shown in the table below:
The test click rate has improved a bit more relative to the control click rate, but have we hit 95% significance?
BOOM! The calculator shows that the test is significant, and we see a p-value of < 0.05, so we now have 95% confidence that our test is better than the control, right? Time to party? Well, almost. Turns out there is one more thing to check. Try the following exercise yourself using the A/B test calculator: what does it tell you if you put in 10 visitors for variants A and B, and 1 conversion for test A, and 5 for test B? Go do it right now, I’ll wait…
It shows it as significant with a p-value of less than 2%, right? Does that make intuitive sense to you? If Ollivander’s Wand website is receiving thousands of witchy and wizardly visitors a day, and the test is showing significance after the first 20 visitors, would you feel confident moving 100% of the traffic to the winner? Hopefully not, we haven’t shown the test to enough people yet.
In statistics there is a minimum population size we need to expose our test to in order to be sure that the control group and test group are both distributed normally.
There’s one more piece you need to check for in order to be sure a test is significant, and that is your minimum sample size. Remember how we talked about those Bell curves being normally distributed? Well in statistics there is a minimum population size we need to expose our test to in order to be sure that the control group and test group are both distributed normally, and that our whole methodology is even valid. It makes intuitive sense in the contrived example above with only 20 visitors but is easy to forget when you are talking about thousands of visitors… How many is enough? Luckily, there is a handy tool for this as well, I like the one offered by Optimizely.
Plugging our results into this calculator, we have a control variant conversion rate of 10.75%, and we are trying to measure a relative impact of 9.21%, which is the difference between our test variant click rate and the control click rate. Note that we want relative impact, which can be found as an output on the A/B test calculator, or calculated by using the following equation:
And finally we are looking for 95% confidence. The calculator tells us we need 15,000 visitors per variation to truly be 95% confident that the results we are seeing are not just chance, so we would need to run this test another 5 or 6 days. Try playing with the relative difference between conversion rates (labeled minimum detectable effect), if we had a difference of 15% between control and test click rates, we’d only need 5,000 visitors per variant and the test would already be done! This means that the bigger the effect you see between your test and control, the fewer users you need to be sure the effect is real, which makes sense. Also, remember when we chose to make our tracked metric the number of clicks on the “Buy Now” button, and not the total number of purchases? The minimum sample size is the reason why. Imagine our site-wide conversion rate is 2%. If we put that in the sample size calculator instead of 10.75%, we would need a whopping 110,000 visitors per variant to detect a 9.21% boost! The reason is because so many fewer people actually complete a purchase compared to the number of people who click the “Buy Now” button, so it is harder to detect a real signal in the data. As a final note, it is worth considering whether the sample size you have exposed your test to is representative of your entire population. If only 5% of your visitors are goblins, but they represent your most lucrative customers, you should check to make sure that around 5% of your test population are goblins.
That’s it! That is the whole process you go through when doing a split test. We covered the happy path, which is when our test wins, but what about if the test isn’t looking good? How do we know when to call it? You could calculate the numbers as a two-sided test and see if you have significance in the losing direction. But usually, if you have a losing variant, you don’t want to leave it running for too long, since it’s hurting your overall conversion rate. In this case, I tend to kill tests early. If I have a couple thousand visitors in each variant, and my test conversion rate is lower than my control, it is unlikely the test will come back to be a significant winner. Statistically, there is a chance that you are throwing away a test that might eventually win, but I’m ok with that small risk to mitigate a negative effect on my site’s conversion rate and to get a new test out the door. And what if there is no appreciable difference between the test and control? It is a good idea to pick a minimum effect before starting the test that you would consider a success. For example, if Ollivander only wants to implement changes that yield at least a 5% or greater improvement, and our test is showing that the improvement is likely only to be ~2% after reaching significance, we can kill the test, and keep the control.
Thanks for sticking with me, it is a long subject to explain in detail, but one that will be invaluable to you as a designer. If you appreciated the article, please give me some applause by clicking on the clapping hands!
¹ Some savvy readers will note that the risk to your business will likely be the same whether you run a losing variant on 50% of traffic for 1 day, or 10% of traffic for 5 days. However, this ignores the real business implications of cash flow and the human limitation on how often you will realistically check on a test. If the test variant performs 50% worse than the control, it is likely you will cut the test the next day when you check in on your data. At 50% of traffic, this could have a big impact for the day (a 25% reduction in sales), whereas at 10% it would have had a reduced impact on sales (5% of sales), and you would have learned just as quickly it was a losing variant. There is also the chance that the losing variant is so horrible that is has long term impacts that will affect the repeat business of your customers.
² An apology to any statisticians or data scientists reading this and groaning. I realize that this is not technically accurate, but for the audience addressed in this article, I believe this is a useful way to envision confidence intervals, and is practically correct.