“But it’s only 5 users, it doesn’t mean a thing” — are AB tests better than user testing?
“Should we do AB tests or user tests?”
“Why even bother with user testing when we can do AB tests and be sure?”
“It’s only 5 users, that doesn’t mean anything. It’s not statistically significant.”
Have you ever heard any of the above? I haven’t heard them for a while, but recently they popped up again. It’s really sad to hear this at work because both methods are tools that designers, product owners, data analysts should know of, and yet some people choose to use only one of the methods to answer all the questions. For those of you that haven’t heard of AB testing, it is a method where you release two variants of the website or app into the real world. Different users will get different versions, they won’t know about the other one and you will see which one performs better. It’s similar to clinical trials, where some people get a placebo and some new drug. In the end, you get neat results summarized like “By redesigning comment section conversion to buying bananas has increased significantly (+23%)”. It’s clean, it’s simple to understand, there is no hesitation. Who could resist information like that?
On the contrary, there is user testing, qualitative research, where people are asked to perform a realistic scenario on your product or prototype while explaining out loud what they do and why they do it. The rule of thumb is to test with 5–10 users. As an outcome, you get a list of problems and recommendations. In comparison to AB testing where the number of users can be in tens of thousands, it seems just weak and irrelevant. It seems like AB testing is just a more advanced method, doesn’t it? Well, it’s not.
Neither user testing nor AB testing is inferior. AB testing is about quantity, user testing is about quality. AB testing answers questions about which version performs better and how big is the impact. User testing answers question such as if users understand the solution and what they think about it. Each method has its benefits and limitations.
User testing
User testing is a method of observing users using your system. The biggest benefit is those users share their thoughts out loud and because of that, you understand what they do and why they do it. Therefore user testing works best for:
- Catching confusing places. You hear what users say while they try to finish tasks you prepared for them. You will see them frown, bite their lips, maybe you hear the occasional “f*ck” when they get frustrated with it, or you catch a smile when they discover something pleasing. You will see them click on things that are not clickable, trying to find answers to questions you have not thought of. AB tests won’t tell you that, because it’s hard to measure something you were not aware of.
- To see and understand the product the way users do. Users don’t read everything you write, they scan and assume. If something looks like an apple they probably assume it is an apple, and you should know it if you are selling bananas. When you create a product, you know what actions are possible, and what’s out of scope, you know technical and business boundaries. You may spell out everything to your users, but they still may not understand. They are just people with different mindsets and knowledge than yours.
- Understand how and when would they use your product. Even though people can’t be trusted when predicting their actions, your users can have better or just more concrete ideas when and how they would use your product. It can help you a lot with advertising it or better understand use cases e.g. users mentioning that they use bananas as an ingredient for homemade cosmetics when you thought about it in a culinary aspect only.
Limitation:
- Purchase power. User testing is based on creating something similar to a real-life situation, but it’s still pretending. People are not good at predicting their behavior. So even if 5 out 5 users declare they will pay, you should treat this information with a pinch of salt, unless they want to give you money straight away.
- Small details. You design complicated systems, each element may play a crucial role in it. With a sample of 5 users you won’t catch all the issues, you may not see if for example changing the color of the button will increase the conversion, you may spot if it’s not visible at all, because of the color or placement, but if you want to know if green or blue will work better, probably better use AB testing.
- Users won’t be able to explain everything they do. Even though you may learn a lot about user mindset, people do some actions automatically and when asked for an explanation they will make up something. So always put more focus on what they do, rather than what they say.
AB testing
AB testing is great for:
- Measuring conversion. Because the test is run in the real world, all user motivations and other factors that may influence the behavior are real. Imagine you are asked to choose between two types of cookies. The first option is sugar-free, wholemeal packed with raisins and nuts cookies, second — good old chocolate chip cookies. How different your answer would be depending on the circumstances? If you are asked by a researcher in the lab to choose, you may go with a healthy option just to please them, or maybe because the circumstances were good and you just had a big breakfast. Even if they ask how often you choose a healthy option in a store, you may answer that you just go with chocolate chip cookies once a month and usually you are a healthy cookie person. But if you got a chance to measure your behavior, you may notice you actually pick the chocolate chips one 5 times a month. It’s not that you were lying on purpose, you’re just not very good at calculating our actions or predicting the future ones.
- Optimizing details. Small things can matter a lot, but we are not necessarily aware of how and why they influence the decision. Did you know that flooring can influence our shopping decisions? That’s not something you would be aware of and would tell the researcher. For that reason changing small things makes sense. You may want to change the background color, the font size, the size of the search bar, that all can have an impact on your business, but your users wouldn’t be able to notice it.
- Testing alternative patterns. Often you see there are two alternative patterns that competitors use equally. For example, you are not sure if you should have plans and payment form in two steps or just one? Should registration display all fields upfront or do it gradually? In user testing, you may learn both approaches work as well for users, especially if they are both quite common, in real life, one of the options may convert more.
- Measuring the impact of changes. Imagine you are adding music in your cookie shop. From a user test, you will get a general idea, if your customers will find it appealing, how they react, but before you invest in a music license, you may want to check if they really will spend more time in a shop and buy more, because maybe they don’t and buying the license will only be a cost, but not generate profit for you.
Limitations:
- No explanation for user actions. If your design is confusing, the results may be misleading. Your conversion rate may be going through the roof, but it doesn’t necessarily mean users love your product. It may be that you’re hiding the back button below the fold or wrote important information about plans in tiny grey font. Or as mentioned above, maybe people thought they were buying apples instead of bananas, they will get bananas and they will never come back to your shop again.
- It doesn’t show how hard it’s to perform actions. AB testing won’t show you where your users make mistakes. If you have a good data analyst that sets up tracking, there is a chance you will see some hints that something is not working, maybe users will hit the back button more often than register, but that’s not the best method to look for the confusion, as it doesn’t show the why and how behind users’ actions.
- No sign of things that weren’t your measure targets. In the AB testing you measure the performance of elements you expect may change, but what about things you haven’t thought of? Imagine you are testing a new door, so of course, you measure how often users push the handle and how often they pass through the door. But what if they get stuck and instead of going through the door started going in through the window? What are the chances of you catching that behavior?
Both methods are equally good. They are different and support each other. Use both, just pick the right one to answer your questions.
Products we design are complex, they have many layers. The Internet world is constantly changing, new apps are popping up, people’s expectations and mindsets are changing. If we want to solve problems, we need to understand them well. If each experiment is one question we try to answer, you wouldn’t expect to find all answers after one question, would you? The only right solution is combining multiple methods. You should look at our problems from different angles to understand what we are really dealing with. Each method should support or contradict your hypothesis. If each of the experiments will start repeating similar data, it means your research and understanding are on point. You don’t have to choose one method. You can choose multiple ones, actually, you should. It’s like with a toolbox. Nobody would argue that a hammer is better than a screwdriver. You just use them on different occasions and I’m sure you would love to have both at hand.