A tree test is a research method used to discover how easy it is for your users to find content in a site’s taxonomy. It can be done to benchmark your current site or validate one you’ve created. We decided to do both so we could compare the two sets of results and see if our newly proposed taxonomy was an improvement or not.
Throughout our time rejigging the Information Architecture of our site we learned a lot that fed into how we set this test up. There are quite a few factors that need to be refined to get the best results possible — amount of participants, amount of tasks, the instructions given, and more. They all have an impact on the validity of the test and so it was important for us to do a bit of research with regular rounds of guerrilla testing to make it as air tight as possible.
What content to use
The first thing we wanted to know was what content we should use in the test. Participant fatigue was a serious consideration in this as it’s a fairly intensive study that requires the user to read and understand a task and then review a navigation structure to find an answer they believe is right. We knew that not all of the 36 items we used in our card sort would be able to make the cut for this test and so we had to find a way to choose the most relevant content and the right number of them.
To choose the most relevant, we first listened to the business to understand which content had the biggest commercial impact and then looked to the data to see which pages users were most interested in and merged those findings to create a ranking system.
Using our own analytics software and Google Trends, we created a spreadsheet to order our content by popularity which was measured through a mix of searches and page visits.
Unsurprisingly, this data matched up with the content the business wanted us to focus on. We then set importance rankings against each of them to know which ones were of high, medium, and low worth. This was done so that we could include a mix of this content in our test and to be able to weigh it towards items that were of a higher importance to us.
How many tasks to use
Knowing this, we then set to define which content we wanted to test. A set of 20 was chosen as a good starting place to find out if that was too many or if we could include more without tiring out our participants. We ended up with more medium tasks due to the fact that we were able to include all 8 of our high importance tasks.
How to write tasks
The next step was to write the tasks for the test. This study was to test our new structure but it was also done to test the labelling of the content as we’d gone through several workshops with our Brand and Copy departments to see if we could reword our content in a more user-friendly way. This meant that the tasks needed to be worded in a way that allowed the user to know what content they were looking for but not lead them on in any way. We did this by creating scenarios for the participant and making sure to not include any words in the task that were similar to the page titles.
With the content chosen and tasks written, we had our first iteration of the test. To improve the way it was set up we ran several rounds of guerrilla tests in the office to find out how quickly the test could be completed and then interviewed the participants afterwards to see if they understood the tasks and if there was anything else that we could change and make better.
What we learned from that was that 20 tasks was actually a good number. We couldn’t push any more into the test but everyone felt like it was a good number. We also saw that in some tasks a few users were going to the wrong piece of content due to the wording of the tasks being confusing. We then reworded those tasks to make them more clear and retested them to see if users would now understand them.
To avoid the possibility that we would include results from participants that were unsure about their answers, we brought in an SEQ to determine the confidence the user had when answering that task. After every task the user would be asked to answer how confident they were that they got the correct answer on a scale of 1 to 7. This would later help us greatly in analysing the results.
Number of participants
Once we were in a place where we knew that we had tested our study enough and fixed all issues found, we released the study to 76 participants. This number was chosen so that we had enough statistical significance to represent each of our personas and the market share they individually have. We used the same amount for our card sort.
This was another stage in setting the study up where we knew participant fatigue could negatively affect the validity of our results. If the tasks were always in the same order then the last ones will suffer worse than the ones being asked first. To avoid that, we randomised them. This was done by splitting the 20 tasks into two groups that roughly shared the same importance mix of tasks. This gave us two groups of 10. As we wanted to benchmark our current taxonomy as well as validate our new one in the same study we would use a different taxonomy for each group of 10. These two groups were then randomised in which order they were shown to participants as well as the tasks within the groups being randomised.
The final (and most interesting) part of this study was analysing the results. Inspired by Jeff Sauro of MeasuringU, we created a graph of our results that compared the success of task completion against user confidence and broke it up into a matrix that was able to describe our findings.
What we found was that our current taxonomy was incredibly poor, users thought they were going to the right places when in fact they weren’t. This was a great discovery when contrasted against the fact that users were both confident and correct in getting to the right content using our proposed taxonomy.
Having this data allowed us to be confident in knowing that what we were suggesting would actually work and we could use our findings to present back to the business to back up our ideas.