Testing Chronicles. Chapter II

Vanilla Thunder
Bootcamp
Published in
25 min readJan 8, 2024

--

This is the second part of our easy-to-follow guide on user testing and validating your ideas. In this section, we dive into the various types of tests, offer practical tips for conducting these tests effectively, and guide you through analyzing the results. For the basics, including an introduction to user testing, idea testing, and the initial preparation, be sure to refer to Chapter I.

User test types

Today, let’s explore the available testing methods or refresh your knowledge about them. To better grasp each method, I’ll provide a concise overview. I’ll start by explaining the purpose of each method, delve into its specific preparation and execution, describe the expected outcomes, and finally, touch on the potential challenges you might encounter.

Please note that when I discuss why a method is needed, it represents the most common use case. However, as I’ve mentioned earlier, these labels are conditional and advisory. Nothing stops you from exploring alternative applications, so feel free to experiment.

To offer a clearer understanding of what the test entails, I’ve curated live examples using our internal free tool, Display. It’s currently entirely free, and I’d greatly appreciate any feedback you’re willing to provide. We’re all ears and continuously striving to improve :)

Questionnaires and Follow-ups

Not bad, but the first method isn’t quite on point. Surveys are more qualitative and quantitative research methods, but testing without them is challenging to envision.

Why they’re needed

Even when conducting usability testing, which involves a clear script, after its completion, there’s always a desire to ask something like, “So, how did it go?” These questions are called “follow-up questions” or, in simpler terms, “subsequent” ones. They can enrich your data and provide a different perspective on the results. For instance, a usability test might be error-free, yet the interaction rating remains low. This suggests that users completed the scenario but rated the entire process poorly. Therefore, the problem lies in another aspect of the UX umbrella.

How it works

In essence, there’s nothing complicated here. You pose a question, and the person responds. The key is to formulate the question correctly. As I’ve mentioned multiple times before, you first need to understand the hypothesis you’re testing and the data needed for that. If you can formulate a question directly — great, but it’s often helpful to break down hypothesis testing into multiple questions to better understand the reasons for the answers.

Let’s consider our case with the removal of the order confirmation page in the online store. After a scenario test, you want to gauge user satisfaction.

Instead of just asking, “So, did you like it better?” — that’s too general. Instead, you can ask the participant to rate the interaction on a five-point scale and explain their choice. Then, you can inquire about what could be improved to achieve a higher rating.

Moreover, you might need additional questions to sort participants into different categories, making it easier to identify correlations between parameters. For example, you can learn the participant’s role and then compare groups.

After formulating your questions, you need to choose the right way to answer them. For open questions, it’s straightforward: users will have to write something anyway. For this, you can use standard input controls or text areas, depending on the expected input. But with other types, you’ll need to think a bit more.

If you want users to choose from proposed options, take some time to consider them. There shouldn’t be too many to avoid violating Barry Schwartz’s principle, but not too few to account for diverse opinions. Also, pay attention to the range of responses — align it with reality by excluding impossible or improbable options and providing an “Other” field. If it’s about NPS or CSat (although they are limited in testing), choose an appropriate scale: numbers, stars, or emojis, and decide on the number of options available for selection. It all depends on how you plan to process the data you receive.

I won’t delve too deep into this topic to avoid turning this text into an article on conducting surveys. Nevertheless, I hope these tips will prove useful to you.

Tips

Whether it’s a moderated dialogue or a remote survey, you can’t simply ask users about their future intentions, like “Would you buy such a product?” Users tend to model their experience and, like everyone else, aren’t particularly good at it. Instead, inquire about their factual past experience.

  • Formulate the question as clearly as possible, avoiding terms and slang known only to one deity.
  • Avoid negations; if you’re talking about costliness, don’t say “not inexpensive,” say “expensive.”

Try it out in action:

Card sorting

Why it’s needed

Originally, this method is used to validate hypotheses related to information architecture. However, it has also proven effective in other areas, such as prioritizing features.

How it works

At its core, this test employs a strictly defined set of entities. Most often, these are section names of a service or other elements of information architecture, referred to as cards.

The essence of the method lies in providing respondents with this set and asking them to sort the cards into category boxes. Sometimes the “boxes” are predefined, and the user can simply place the cards accordingly. Other times, respondents can create their categories based on their mental model of organization and label them with a “marker.”

The fixed categories approach proves beneficial when we cannot alter an established hierarchy or want to use the test for prioritization. In such cases, the cards could represent both product features and user needs we aim to fulfill. They’re then allocated across four buckets (using product features as an example):

  • Must — a vital product feature without which the product loses its essence,
  • Should — competitive advantages that hold significant weight,
  • Could — nice-to-have features, but not indispensable,
  • Wouldn’t — features that should definitely be avoided.

Conclusion

The output provides a qualitative model of one respondent’s perception of how the proposed elements should be grouped. However, at the test’s aggregate level with a sufficiently large sample, the data can be approximated and take on a quantitative nature.

Tips

  • If there are too many cards, don’t attempt to fit them all into one test. Consider limiting the number of cards per respondent to keep their cognitive load manageable. Collect results from various segments.
  • By default, it’s advisable to allow users to create their categories, as otherwise, you’re pushing the respondent towards an expected answer.
  • If opting for card sorting for prioritization, provide clear category definitions; otherwise, you leave room for the respondent’s imagination.

Tree test

Why it’s needed

A close sibling to card sorting, it’s also known as reverse card sorting. Its purpose is identical — to assess the suitability of the information architecture you’ve created.

How it works

It’s the complete opposite of card sorting. Now, users are presented with a pre-built hierarchical structure, and their task is to find a specific item. There’s no sorting involved, only searching and selecting the desired item.

Conclusion

Through this test, we obtain a qualitative understanding of whether our information architecture is effective or not. This is indicated by low search times and the number of misclicks or incorrectly opened categories.

Tips

  • Clearly explain to the respondent what will happen during the test and what you expect from them. Otherwise, they might accidentally start opening everything they see.
  • Avoid hiding sought-after items too deeply; doing so significantly increases the test’s complexity.

Preference test

Why it’s needed

This test stands out as one of the simplest and helps identify which of the presented options respondents prefer. It’s suitable for comparing almost anything, except perhaps dynamic aspects like interactions.

How it works

There’s not much to say here either: users are provided with options to choose from, and they make their selection. Usually, after the test, there are follow-up questions aimed at clarifying the reasons for their choice or assessing the degree of differences.

Conclusion

The output yields the proportion of users distributed based on their preferences. What to do with this proportion will be discussed further.

Tips

  • This test alone isn’t very informative; it provides a quantitative measure of preferences but doesn’t explain why they turned out that way.
  • Avoid offering more than 4 options for selection, as doing so might scatter respondents’ attention and their distribution.
  • Explore the tool in which you plan to conduct this type of testing and ensure you test how users compare and choose, as this could significantly impact the result.

Try it out in action:

5-sec test

Why it’s needed

This type of test is used to assess users’ attention distribution, for instance, how well-structured a layout is or which elements carry more weight. Similar to a preference test, this type of test doesn’t hold much meaning on its own and is evaluated in conjunction with subsequent questions.

How it works

As the name suggests, within 5 seconds, the respondent studies the provided image, after which the test concludes, and questioning can begin. If it was a page layout, questions can revolve around what element caught their eye first, what they remembered, or more specific inquiries about the arrangement of elements. Here, it’s not just about the test result but the clarifications that follow.

Conclusion

Nothing. All information is gathered through subsequent questions.

Tips

Don’t be deceived by its apparent simplicity; such a test can quickly reveal issues with poor visual hierarchy or an incorrect initial impression. So, don’t write off this test too quickly.

Try it out in action:

First click test

Why it’s needed

To determine the user’s clarity in perceiving controls or visual hierarchy.

How it works

Essentially, it’s a five-second test on steroids: there’s no time limit; instead, the respondent is given a task — to find specific information or perform a particular action — and make a corresponding click. Just one click. Hence, the name of this test.

The tasks are formulated to eliminate the possibility of accidental clicks, meaning they immediately warn the respondent that they will have only one click.

Conclusion

Typically, you’ll receive a heat map of click distribution, allowing you to calculate the expected perception ratio across all users.

Tips

For mobile devices, ensure the platform accurately registers taps and scrolls; otherwise, you’ll get a distorted picture.

Try it out in action:

Alpha & Beta tests

Why it’s needed

To check the alignment of the final product with user expectations, identify bugs and deficiencies, and build a loyal user base to facilitate an easier official release.

How it works

Alpha and Beta tests are not among the most common tests I’ve described because these methods require the product to be almost complete for proper execution. By the way, these aren’t standalone methods but rather phases of the same approach. The essence lies in launching a specific group of users behind a closed perimeter (a closed server) so they can familiarize themselves with an early version of the product, provide feedback, and test the functionality.

The difference between Alpha and Beta lies in the fact that the former is more bug-prone and resides on an internal server, tested by internal teams (developers, BAs, sales). In Beta, real users are invited to participate.

Conclusion

Following Alpha or Beta, you’ll receive a wealth of user feedback and uncover a fewer number of bugs. Additionally, you can use this feedback when creating the so-called rollout strategy, i.e., the production release scenario. By limiting the user segment and effectively handling all the feedback, you’ll garner an army of advocates who will promote your product for free and generate the necessary buzz.

Tips

  • Ensure proper selection of the test group. Diversify the group to cover as many scenarios as possible.
  • Use regression testing to ensure you’re not breaking anything while fixing past errors.

Multivariate Tests (A/B/n)

Why it’s needed

This test serves to compare the effectiveness of two or more design variations for the same task. It’s also known as Split-testing.

How it works

In essence, conducting such a test requires preparing several design variations, be it in a prototype or in production. Then, users are randomly allocated between these variations to assess which solution yields the best results. However, to ensure smooth execution, several key aspects need consideration.

Firstly, determining the benchmark for comparison is essential. It’s not always feasible to obtain an absolute victory for one variant since each may have its own strengths and weaknesses. Hence, it’s vital to forecast expected changes and prioritize metrics to avoid getting stuck when declaring a winner.

Secondly, define what you’ll be testing. Seems obvious? Not quite. Imagine facing the task: “Determine what’s better for showcasing products to increase conversion: a list or cards.” To obtain clear results, use the same dataset and context. Otherwise, you won’t confidently pinpoint what impacted your metrics. If you’re testing a format, stick to testing that format and don’t introduce new hypotheses, or else correlating factors won’t be easily identifiable.

Thirdly, ensure uniform distribution and adequate user diversification, making the distribution not entirely random (if you have the ability to influence it). This is necessary to mitigate risks associated with unevenly distributing irrelevant users within one group, which can distort results.

When this method is used in production, a portion of the main traffic is allocated to the candidate variant, and the results are compared with the current version. Such an approach has its drawbacks, such as the learning curve or the novelty effect. Users might reject the new approach in favor of the existing one or vice versa, simply because it’s “new”. For instance, in assembly line experiments, when working conditions for laborers improved or worsened, any changes were perceived positively. Therefore, for precision, consider dedicating separate streams for each variant and include a control group where nothing changes.

Conclusion

After A/B/n testing, we obtain quantitative data on the interactions of each user group. Of course, we can supplement the test with subsequent questions and attempt to compare the responses, but this approach suits only a controlled environment.

Tips

  • Before testing anything, contemplate how you’ll choose the winner.
  • Focus on one variable at a time.
  • Comparison should occur in the same, natural environment to minimize influence.
  • Don’t grant immediate access to multiple variants; this can confuse users and introduce personal biases into the results.
  • To draw accurate conclusions, the variant sample size must be sufficiently large to speak about statistical reliability (more on this later).

Guerilla testing

Why it’s needed

To quickly, roughly, but effectively test your ideas without involving a budget or significant effort. Also known as Guerrilla Testing.

How it works

There are many variations here, ranging from forcefully surveying nearby colleagues to nearly comprehensive user testing. Let’s consider a more serious variant.

Firstly, define the area where you need testing support. As we know, simply asking, “How do you like it?” isn’t too effective. It’s best to express all your thoughts and questions in the form of hypotheses and create a testing scenario, as we discussed earlier. Even if it seems like a waste of time, the session becomes more structured. Remember not to go beyond 15–20 minutes; otherwise, you’ll just exhaust your counterpart, so keep it concise.

After scripting the scenario, or even a protocol, we can proceed to find respondents. This is where guerrilla methods differ from comprehensive user testing because the respondents aren’t always your end users. Test with whoever you can find. Work with what’s available.

You might object: “But won’t we get unreliable data?” Yes, but they’ll only be unreliable for parts that require specialized knowledge. If we want to verify general perception patterns, they don’t vary much from person to person; after all, we’re all human.

Nevertheless, avoid testing on experts like yourselves. With experience, some complex matters start seeming ordinary to us, and we tend to assume everyone sees it the same way. But they don’t. We’re on the cutting edge of the domain, so to speak, representing maybe 10% of people possessing sacred design knowledge, whereas 90% of the population struggles with email. For better clarity, it’s best to step away from the design department.

So, once a suitable candidate appears, seek their consent and conduct the session as naturally as possible. We’ll discuss how to do this correctly in the next test. But I’ll note, most likely, things won’t go as planned, so maintain flexibility and adapt to how the test unfolds. Just don’t cross personal boundaries or be too intrusive. It’s entirely normal if you can’t check everything you wanted.

Finally, nobody expects reports from you, so all insights you gain will remain in your head. Sure, it’d be great to document all these insights to avoid reinventing the wheel in the future, but no one demands this.

Conclusion

After guerrilla testing, you’ll obtain raw qualitative data that lacks scientific reliability but can safeguard you from early mistakes.

Tips

  • Even for impromptu testing, try to define the area you need or, better yet, write scenarios.
  • Test only general patterns; save specialized knowledge for comprehensive research.
  • Avoid testing designers; they know too much.
  • Make notes during tests to refresh your memory with facts rather than inventing your own narrative.

Usability tests

Why it’s needed

To verify a user’s ability to complete their scenario without a qualitative evaluation. Many confuse User Testing and Usability Testing, thinking they are the same. However, terminology distinguishes Usability Testing as examining only one aspect of interaction — the actual scenario completion — while User Testing considers interaction from various perspectives.

How it works

As you read further, remember this discusses Usability Testing. Given that testing only checks the ability to perform specific actions, start by outlining the sequence of these actions. Open your Jobs-to-be-Done (JTBD) or personas and select the goals and tasks users should accomplish through your designed interface. If necessary and if they seem too extensive, segment the goal achievement flow into scenario fragments. However, avoid having too many — 4–5 is optimal for a test.

Begin unwinding the scenario, describing each action the user needs to take. Yes, it might sound a bit wild, not the usual way we’re accustomed to. Is it for simpletons? Yes, exactly. Let’s try it together using an example.

Goal: Place an order on an online store.

Actions:

  1. Navigate to the product catalog.
  2. Go to the “Killer Speakers” section.
  3. Find the “Marshall Kilburn II” speakers.
  4. Add the speakers to the cart.
  5. Proceed to the checkout.
  6. Fill in the form and click “Buy.”

The steps can be even more detailed based on your level of madness.

After creating this… script, you can allocate time for your introduction, where you’ll welcome the respondent, reassure them, and engage in a brief conversation. Remember that if you’re testing a prototype, it’s crucial to explain that some parts of the interaction might not work or work incorrectly.

Additionally, it’s good practice to assure the individual that we’re testing the interface, not them, and there are no right or wrong answers. The respondent can greatly assist by pointing out interface areas that raise doubts, irritation, or simply confusion. This is crucial.

In the early 201Xs, when people weren’t completely online yet, such tests were often conducted in person and frequently under a moderator’s guidance. In the classic scenario, the moderator sat beside the user, maintaining a calm demeanor while observing the user’s struggles in finding the “Add to Cart” button. They checked off the scenario point when the respondent’s struggles ended. This indicated test failure, requiring a review of that step in the flow. As you understand, few follow the canonical scenario now because it’s rigid and costly. Often, user testing is preferable.

Conclusion

Upon completing this “veteran,” you’ll gain a comprehensive understanding of a user’s ability to complete the scenario and pinpoint where issues arise.

Tips

Due to its inflexibility, this test might seem outdated and unnecessary, but give it a chance. I believe there are still scenarios where it can prove useful. For instance, when interactions are so complex that you want to test them in multiple stages.

I combined the user test and the usability test into one test in the following section to better feel the difference.

User testing

Why it’s needed

The apex of human thought. Just kidding, of course, but at the moment, it’s the most prevalent and high-quality method for testing user interaction. It’s needed to obtain a complete picture: from color perception to the origins of usability issues rooted in the respondent’s childhood.

How it works

This testing greatly resembles Usability testing, but here, you don’t confine the user with a strict set of actions; you allow some freedom in achieving the goal. Nevertheless, scenarios still need to be written.

As always, start with hypotheses. Once you’ve compiled your list (hopefully always at your fingertips), sort them by the time of occurrence to avoid jumping around different topics and awkwardly saying, “Oh, do you remember where you…?” Each hypothesis needs its place.

After gaining a clear understanding of where, when, and what you need to know, group hypotheses according to logical structures and assign a verification method to each, following our familiar scheme. Sometimes, we need to simply observe behavior, ask clarifying questions elsewhere, or even the user’s inactivity could be sufficient. But most of the time, we’ll ask the user to perform a task. However, you don’t need to detail every step as with Usability testing. It’s enough to express succinctly and clearly what you expect from the user.

Personally, I prefer combining all tasks into one story, where each subsequent step complements the previous one. Adding introductory context to this creates almost total immersion, rather than just a survey. Storytelling suits both moderated and unmoderated testing. The only thing you can’t conduct in such a case is the session of thinking aloud. Although, with effort, almost anything is possible.

Thinking aloud is a user testing method where the respondent hardly ever remains silent, providing a detailed commentary on each action, their current thoughts, expectations, and reasoning while solving the task. This stream of consciousness is a treasure trove of invaluable insights into the user’s mental model. However, maintaining such a mode can be a real challenge.

On the internet today, everyone wants to express what they think. However, during user tests, for some reason, people prefer to stay silent. To overcome this awkwardness barrier, consider the following steps:

1. Before the session, explain the rules of the game. Explain that you need the respondent’s help in improving your product by asking them to perform a few tasks while addressing any clarifying questions. Assure them that you’re evaluating the solution, not their intellectual abilities, and that any input will be incredibly helpful.

2. Explain that you want the person to reason aloud, voicing their thoughts, even if it might feel slightly unusual. Show them what you expect by providing an example yourself. Verbally reason aloud using the first web page that pops up. This helps people relax, not fear seeming foolish, and gives them approximate boundaries for how much to disclose their thoughts.

3. Start by executing the first task and remind the user to think aloud initially with direct questions like “What are you thinking about right now?” or “What are you looking for on the page?” or “What emotions did this action evoke for you?”

4. Avoid nudging the respondent toward your desired conclusions, even though it can be very challenging. The key here is not to think for the user, no matter how much you might want to, but to wait for their thoughts. Afterward, you can reiterate what they said, asking, “Is that correct?” to avoid double interpretation.

During unmoderated testing, of course, you won’t be able to ask questions while the task is being executed, but you can ask some things afterward. Even with such limitations, you entirely eliminate your influence and environmental influence since the person will undergo the test in their familiar setting.

RITE Modification

Usually, you’d want to schedule tests closely together to not drag things out and get answers as quickly as possible. Such a dense testing approach will give you clean data within a short timeframe. However, if your primary goal isn’t a beautifully scientifically sound report but adapting the solution to user needs, I’ve got something tailored for you.

You could theoretically arrive at this logic yourself, but since someone has already thought about it and given it a name, why not use it? Imagine that after the first test, it becomes crystal clear that certain parts of the flow are so broken that they’re beyond repair, and other respondents will also stumble upon these errors. Why conduct more testing sessions if you can fix your prototype and check a new version? Sounds logical. Of course, not everything users mention needs immediate fixing. Before getting carried away, it’s better to assess the popularity of the identified problem, adding it to your list of hypotheses and adapting the script. And who said the script can’t be changed?

In doing so, you’ll lose credibility for the entire sample because you’ll be showing different solutions to respondents, making it impossible to later summarize everything in a table with percentages. However, you’ll be able to correct your solution faster and with less money spent, saving people’s time as well.

When should you end this cycle? Let’s say after the correction, two more respondents passed the test with flying colors, but the third stumbled upon something else. Then the counter resets, and you need to start again until you hit a combo of 5 successful completions… or until the budget runs dry.

Conclusion

Depending on the chosen test methodology (and we mentioned that almost all test types can morph from moderated to unmoderated and vice versa), we can obtain both qualitative data about subjective perception with root cause analysis and quantitative data indicating problem popularity and reducing the risks of approximation (when we generalize the findings to the entire sample).

Tips

I’ve shared quite a lot during the test description, but a couple more closing thoughts:

  • Avoid making super-small tests for just one click; it’s better to replace this makeshift solution with a specialized 1st click test.
  • Encourage the respondent throughout the test to prevent them from getting tired prematurely. Thank them afterward because even if your solution shattered into pieces, it’s not the user’s fault but rather their help in making you better.
  • Avoid asking clarifying questions about the passage of time — each question has its time; otherwise, you’ll get distortions.

Try it out in action:

Read more about it:

Work with audience

So, let’s review the preliminary results and synchronize our clocks:

  • We have a compiled list of hypotheses.
  • A described roadmap for testing is in place.
  • Scenarios for users have been outlined.

It seems everything is ready, but it’s not. First and foremost, our test needs to be… tested. You heard that right. To test the interaction effectively, you need to initially test the tests themselves. To avoid getting tangled in this recursion, a specific term was devised — pilot testing. It’s no different from regular testing, except that you’re testing not hypotheses but the wording and sequence of tasks. Pilot testing is designed to avoid confusion in this recursive task.

For this purpose, not only representatives of the target audience but anyone capable of reading or listening might be suitable. If your tests are not intended for a narrow audience of experts, you can validate them with anyone.

Both guerilla testing, as mentioned earlier, and partisan investigation can be helpful. Invite colleagues, relatives, and even those who might not be highly experienced in the field but can operate a computer. Basic computer skills are mandatory. It’s essential to gather diverse opinions and feedback.

Once you’ve conducted several sessions with the pilot group, you’ll likely want to make changes to the tests: rephrase, rearrange, or reconstruct them. Don’t hold back; make the necessary adjustments.

However, soon after the pilot and revising scenarios, you’ll encounter another issue: how many people to involve and how to find them? These questions depend on the data you want to collect: qualitative or quantitative.

Let’s start with qualitative testing, as it’s seemingly simpler to organize. The most concise answer to the number of participants for qualitative tests is 5–7 individuals. But you probably want to know how this number was arrived at, right? No? Well, let me explain anyway. To simplify:

“5–7 individuals” size allows for identifying 80% of problems encountered by 80% of users

Of course, qualitative tests won’t uncover all potential issues — it would take too much time. Besides, it’s not our task to pinpoint every single problem; it’s enough to enhance the interaction for the majority of our users. However, it still depends on the specific setup. For instance, if you’re testing a very small flow, then 7 individuals might be more than sufficient to uncover all problematic areas. Conversely, if your test encompasses 30 different scenarios, then 7 individuals will undoubtedly be insufficient.

There’s a “litmus paper” I use to know when to end the test — when insights start to repeat. Once you stop receiving new information, it makes less sense to continue.

For quantitative research, the numbers will be entirely different. Here, we can’t rely on subjective stimuli to determine when to conclude our research. Statistical significance comes into play. In simple terms, statistical significance is achieved when the probability of obtaining data different from the hypothesis is extremely low. In other words, it’s highly improbable that the result will significantly differ from the existing data.

An experienced reader might doubt what “extremely low” means. Unfortunately, a precise numerical value cannot be provided as researchers determine the deviation themselves. Usually, they aim for a reliability level of 80–85% because further increases demand substantial resources.

Let’s consider an example. Suppose you want to test a new app for death metal enthusiasts, measuring the frequency of hair shaking. Here, it’s crucial to define your target population, i.e., the number of all hairy death metal fans potentially willing to use our app. If you already have a user base, it can serve as the target population, but it’s better to add an untapped market share for greater data accuracy and breadth.

You also need to determine the homogeneity of your users, how much each individual metalhead differs from another. In retail, for instance, homogeneity may be low, whereas in specialized services like Wolfram, it’s quite high. Of course, if you’re not conducting scientific research and don’t intend to write an article about your testing in a scientific journal, you can disregard homogeneity and assume that the difference won’t significantly affect the results.

But if the “sameness” of people can be compromised, there’s one parameter you definitely need to pay attention to — representativeness. It determines whether we can generalize the research results to the entire target population. Let’s return to metalheads. Let’s say our app is used by 10,000 hair shakers. 10% are lead singers, 15% bassists, 25% guitarists, and 50% drummers. If this factor significantly influences the interaction, it should be considered when selecting the group. So you need to gather 10% lead singers, 15% bassists, 25% guitarists, and 50% drummers or close to these values. Otherwise, you increase the risks of spreading the results across the entire sample and getting severely burnt. At the same time, if 75% of metal fans are brunettes and 25% are blondes, it likely won’t affect representativeness, and this factor can be overlooked.

Lastly, we come to the answer to the question “How many respondents to consider?” I don’t have the answer, but if you’ve already determined the target population, reliability (also known as statistical significance), and the level of error you’re willing to accept, you can calculate the volumes yourself.

And now — forget everything I wrote above because, in practice, such testing is clearly impossible to set on track. We generate ideas and create prototypes too quickly to conduct tests on 350 respondents (unless these are surveys). More often than not, for testing purposes, all these calculations are simplified, boiling down to the formula “The more, the better.” Nevertheless, ever since the time I wrote my master’s thesis, the number 60 respondents has lodged itself in my mind, after which one can talk about traceable dynamics. However, I notice that the size of 30 individuals is emerging more frequently now. The key is to avoid the error of a too-small sample, where we start calculating percentages from just 7 people.

So, dear reader, if you’re still with me after all the previous manipulations, I congratulate you. You’re on the final stretch. All we have left is to figure out where to get respondents.

Honestly, everything we’ve discussed so far isn’t a big problem because the analysis of the solution, creating hypotheses, scripting, test creation, and analyzing results depend solely on you. With the right level of determination and desire, these skills can be acquired very quickly. But finding respondents for unmoderated testing is an entirely different matter. One might even say that for smooth operation, you need preliminary preparation.

If you already have a product and users, you can turn to them for help. You just need to gather contacts of those willing to assist using a specialized tool or through survey distribution. This way, you’ll gain a better understanding of your audience and be able to select a test sample more easily.

You can also seek help from specialized recruitment agencies or local managers. The key is to provide a clear user profile to avoid working with someone who doesn’t fit your target audience.

Epilogue: Results analysis

Finally, our journey to Testingland is coming to an end. At the finale, I’d like to provide you with a few pieces of advice on what to do when all the fun is over, and it’s time to process the data and draw conclusions.

One thing you shouldn’t do as soon as you get your dataset is to jump to conclusions. The temptation to rush into calculating averages and medians is enormous. However, yielding to this temptation might lead to conclusions that are completely contrary to reality.

Firstly, the data needs cleaning by eliminating irrelevant results. Here’s a list of what to pay attention to:

  • Zero results: Remove individuals who didn’t even start the test.
  • Duplicates (if possible): Eliminate those who completed the test multiple times. If duplication occurs only on part of the test, prefer the first completion.
  • Fluctuations: Absolutely incorrect results, like when a user took 3 hours to complete a 2-minute test. Such cases require separate examination but aren’t suitable for test analysis.
  • “Mischief-making”: This includes users whose aim wasn’t to take the test but to mess around, click randomly, or run a test drive.

Once the data is cleaned, it might be inadequate in terms of the minimum sample size or representativeness. Therefore, I suggest having a reserve of respondents to fill in the missing percentage or consider this beforehand and continue testing until you have a sufficient amount of data.

Now that all the data is scrubbed clean and sparkling like a fox’s tail, it’s time to go back to the beginning and see which hypotheses we confirmed, which we refuted, and for which ones we received mixed results. This is where we detailed all those KPIs that now need computation (if the test tools haven’t done this yet) to compare them with the hypothetical ones. If the deviation is significant, the hypothesis can’t be considered entirely confirmed, and adjustments need to be made.

If your hypothesis is confirmed, congratulations! But even if it’s refuted, my congratulations still stand. There’s no reason for disappointment; you’ve gained new information to use in the product’s further development.

Seemingly, that’s all, isn’t it? Data processed, hypotheses tested — finito. But before turning conclusions into actions, perhaps you’d like to explore some peculiarities of our brains, like the influence of the “Qo” status or the learning curve, which will give you an understanding of how events might unfold in the long term.

For instance, if users prefer doing work as they’ve always done, they’ll reluctantly embrace new testing methods initially, but over time, they won’t imagine life without them. Conversely, something new might evoke excitement during testing just because it’s new, but after the initial enthusiasm, it settles into a plateau.

And with that, our journey comes to an end. Naturally, this guide can’t cover all aspects of testing, so I rely on your help here. If you have suggestions for improvement, find inaccuracies, or simply want to share your experience, please leave a comment, and we’ll make the necessary changes.

Thank you so much for your attention, it took quite a bit to make it to the end. Nevertheless, I hope this work has helped you better understand the matter and take your testing to a new level.

--

--

Bootcamp
Bootcamp

Published in Bootcamp

From idea to product, one lesson at a time. Bootcamp is a collection of resources and opinion pieces about UX, UI, and Product. To submit your story: https://tinyurl.com/bootspub1

Vanilla Thunder
Vanilla Thunder

Written by Vanilla Thunder

Dmitry Vanitski, Principal UX Designer, Lithuania

No responses yet