Web app remote usability testing on a budget with Fullstory & Google Forms.

Published in

Prototypr

7 min readJul 11, 2018

In preparation for a closed alpha release of our neuromarketing research web application, we ran a remote usability test with 25 participants in an attempt to catch some of the hidden “error gems” we may have missed during initial in-house validation tests.

Testing a prosumer product can be hard 😖. Especially when you need to find the right audience for the test. We got incredibly lucky with our participants — a university professor in France teaching Marketing Research to first-year students reached out to us willing to provide a class of 25 who can take the test for us.

Problems:

Setting up a testing environment on a bunch of college machines with locked down user privileges and hoping that video capture and necessary plugins would simply work was an unsettling thought.
Paying 50$ per test on common user testing platforms was simply out of our budget as well. We wanted to get as much scale as possible: 1) it was once off opportunity — no second attempts 2) even though most likely there would be little new issues found after first 5–6 participants take the test, it was worth the gamble given the opportunity.
Not one person on our team spoke french and we were told that our participants had varying levels of spoken English, so the traditional approach to recording verbal feedback during the test could have been problematic.

So… we had to get creative 👨‍🎨 👩‍🎨 The right solution came up from the tools we already had in our UX cupboard. Fullstory + Google Forms.

✏️ Note: given that this was a highly experimental technique we ran 5 users from the group using one of the traditional user testing platforms to use it as a backup, but also to compare the end results.

Working with Google Forms

Task scenarios and time to complete baseline

We kicked off with a classic routine — opened a text doc and wrote out the whole list of tasks scenarios, introductions and prerequisites required for the test. We did a couple of trial runs to get a feel of the overall flow and how our subjects will experience it. After a bunch of necessary edits and about 4–5 trial runs, we were happy with the outcome and we started moving the test into Google Forms piece by piece, continuing trials as we went along.

💡TIP: Use sections to break up your instructions, tasks, and questionnaires into digestible chunks.

Before moving forward we did a couple of final test runs, but this time we timed how long it took to complete each task (for us experienced users) to use it as comparison baseline when analysing the results later.

Questionnaires and usability scales

Once we completed our Google Forms tasks setup we added a short questionnaire section to the start of the form, to get some basic demographic details and capture assigned user ID’s.

We closed off the end of the task list with another questionnaire section which included SUS questionnaire, most/least liked features, overall impression and a few questions focused on aspects that were interesting for us to investigate.

Fullstory set up

Getting a build

We got a clean testing environment setup and ensured it will only be used for the purpose of this experiment avoiding QA and DEV interactions during the test process. We created a bunch of accounts with a sequential prefix (usertest_001) and a simple passwords so we can easily distribute those to our participants.

Integrating Fullstory

Fullstory is a really unique analytics tool.

Fullstory replays your customer’s journey — like a DVR for your website — so you can search, see, and understand your user experience.

What’s more important it allows you to easily segment those recordings to get a little more on the quantitative side and play those back to find out what users are really doing on your app or website. Funnels are really top of the range here.

We set up a free account and got a whopping 1000 free recordings a month — more than enough to run a good few user tests.

We needed a little help from our dev team to integrate Fullstory into our latest build. It’s a pretty simple copy paste script integration (also possible via tag manger). But with little JS magic, you can also identify some parameters e.g. username — which came in handy later for matching our captured recordings to the questionnaire answers. After set up was completed we did a couple of test runs to check if the recordings were being captured correctly. And it all worked smoothly 😎

Testing

We contacted our test orchestrators in France and provided them with a list of login details, test introduction and troubleshooting guide. We also provided some basic instructions on how to use this Google Form during the test and suggested to have it side by side with the application window if screen real estate allowed 🖥 or to use a tablet or a phone to have the test right in front of them at all time . ⭐️️️️ ⭐️ ⭐️ ⭐️ ⭐️ for RWD on Google Forms.

💡TIP: Remind your participants that they need to click SUBMIT on the final page of the form — otherwise responses captured will not be saved.

Running the test

Sit back and relax 🍹 while waiting to collect all responses and enjoy live sessions view in Fullstory — a very cool feature.

Analysis

Once all the tests were completed it was time to have a look at the recordings. Fullstory offers a great experience for viewing recordings with clear identification of inactivity times and an option to skip those during playback as well as capture of each and every interaction and great filtering options.

We viewed each recording one by one and took notes identifying the following to the maximum ability of Fullstory functionality:

Average task completion time & average time on task
Usability issues, classifying those using heuristics
Moments of confusion(wandering interactions)
Frustration (rage clicks and multiple attempts)

Grouping

Once we have reviewed and analyzed all of the recordings we started grouping reoccurring patterns and issues to form a master list. As we went along we took note of how many users have encountered the same issue and ranked each issue using severity scale. 🎉 we have a list of issues with the frequency of occurrences and severity.

Loading the whole list into a spreadsheet and adding some color coding produced a beautiful and easy to read end result.

With some filtering and sorting, we were able to quickly create a prioritization.We made additional spreadsheets to note average task completion time, average time on task and also SUS scale calculations. All of those were used in Round 2 testing as a comparison baseline to see how solved issues have helped to improve the above stats.

By comparing our own completion times with participants, we were also able to quickly identify tasks that were most problematic by isolating those that took significantly longer to complete then what our baseline timings were.

Finding & discussing solutions

With all analysis prep work ready, we sat down with all the stakeholders and went through our findings to identify potential causes, solutions and any core priorities that may not have been reflected by severity and frequency scores.

Focusing on high priority issues, we prototyped discussed solutions and validated our assumptions in a series of small-scale tests using basic click-through prototypes. Once we were happy with end result we loaded all the solutions to JIRA and walked our dev team through those before letting them proceed with development.

Round 2 testing

Few weeks later we were all set for Round 2 testing. We ran with a smaller group of participants to validate our solutions.

The end result was astonishing 😍. Tremendous improvements in average time to complete the task, almost complete absence of previously identified usability issues and a serious reduction in frustration and confusion among users all across the app.

Comparison

When we compared our experimental set up to the control test using another platform involving a recording plugin with task display — we were delighted and surprised:

2 of the 5 scheduled tests didn’t run because of issues with limited admin rights
absence of ability to cut out the inactivity times and other cool playback features was making it a little harder to analyze the material
as predicted voice commentary was almost lost due to noisy class environment, bad microphones, absence of a microphone in one instance and varying spoken language ability of our subjects 😦
the amount of issues found was 15% less and it was significantly harder to identify if the issue frequency(when viewed in isolation from other test)

Conclusion

It worked, we loved it and we will do it again! This is probably not for everyone, but if you are running a test in a remote environment with a large group of participants while on a low budget and having to deal with restricted admin rights, poor audio quality and wanting and willing to try something new — give it a shot!

Thank you for reading!
If you liked this article please show some ❤️ and 👏

Editor’s note: This is a first article I have written in a long time and it came out from a large stack of drafts i’ve been procrastinating on publishing for a while. Hopefully this is a start of a continuous written work I’ve been planning to kick off for a long time and I would really appreciate some feedback.

Comment below, catch me on twitter or drop me a mail to anton@lebed.works