Back to {Basics|the Future}: Deploying High-Tech Solutions over Low-Tech Channels for Design Research

Published in

CMU MHCI Capstone 2020: Gov AI

10 min readJul 7, 2020

TL;DR With just three and a half weeks to go, we’re putting our near-final designs through the wringer to validate, challenge, and improve upon the interactions, verbiage, channels, experiences that we’re striving to create.

Lights out, and away we go!

Over the past several weeks, we here at Team Gov AI have been ‘all hands on deck’ to get our most ambitious testing plan ever off the ground for the nearly final iterations of our design. Building on all of our prior work, what we’ve taken to calling our “v2 Usability Testing” is the culmination of our research, design, and development efforts over the past six (!!!) months. As we walk you through it, you might see elements of our previous research efforts, like our remote applicant recruitment via social media, to a virtual focus group with recipients and advocates, and even our in-person background research with experts and stakeholders. We’re no strangers to a comprehensive research plan, but this one takes the cake. Let’s break it down!

What we’ve done so far

Last week saw us launch and conclude the first two rounds (v0 and v1) of our planned usability tests. First, v0 consisted of a series of internal pilot sessions with a focus on getting a grasp of the logistics of testing as well as fixing critical, progress-halting bugs. A couple of team meetings and quite a few tweaks later, we were ready for even more feedback.

The next round of testing saw these tests turned outward as we began to elicit external feedback and put our bug-fixes to the test in real interview conditions. These informal tests were done with 12 external, non-target (but still relevant) testers from various backgrounds; ranging from conversational user interface designers, to students receiving unemployment insurance or financial aid, to senior citizens. The goal here was to run preliminary testing with participants from a range of ages, gender identities, benefits experience, and technological familiarity — the same elements that we codified in our survey for v2 testing recruitment.

Simultaneously, we were spinning down our recruitment efforts on Facebook and Qualtrics (detailed below, in “Who we’re testing with”), finalizing our research plan for v2 (see “How we’re testing”) and building out additional iterations of the prototypes to be tested (the section below, o’ reader mine!).

After synthesizing the feedback we collected from v1, that brings us to this post: the plan for v2 usability testing, with the ‘what,’ ‘who,’ ‘how,’ and goals defined for our largest research effort as a team.

What we’re testing

Last sprint, we walked you through our design process for the prototypes that we took into v1 testing. Since then, we’ve narrowed our focus to two core questions:

Iterating on Voice-First

How well does our voice-first experience support the people we’re building it for, and how can we improve it?

We made the decision as a team to focus our efforts on creating the most thorough and polished voice-only experience in order to show the level of fidelity that can be achieved in such an experience. To that end, we’ve focused our efforts on refining the language used by the voice agent, improving error handling, and being careful to include progress tracking and options for assistance where we’ve seen people get stuck in our initial testing. All of these goals have to be balanced against the fact that audio can only be processed serially by humans — meaning we can only listen to one thing at a time and generally in full before we respond — and that we want to keep the overall interaction brief such that it can be completed easily in one sitting.

It was important to us to carefully craft the initial conversation turns to sound human and allow people to feel comfortable opening up to what is in essence a scripted interaction. We know that the population we are trying to reach includes folks who are not comfortable with voice technology or have little experience with voice agents. For that reason, we wanted to make them feel that they could speak conversationally. That also meant, of course, that we must be prepared for the eventuality that potential applicants would say things that our voice agent could not handle, and we would have to provide a clear follow-up to help them shape their responses so that they could to keep their interaction moving forward.

Validating SMS

Do potential applicants demonstrate a want or need for an alternate, text-based SMS channel?

*A mid-fidelity conceptual prototype of the SMS flow (left) translated into a mid/high-fidelity experiential Facebook Messenger prototype bot (center) built on* *Chatfuel* *(right).*

If we find support for aspects of a text-based flow (something that was initially validated in v1 testing), this opens up a range of opportunities to build out and/or recommend a more expansive screening and app. If not, it allows us to determine how and where text might be best utilized — for example, perhaps as a feedback mechanism after an applicant has finished the voice-first flow or an input mechanism mid-flow for sensitive information while in a public space. We don’t want to jump the gun here, as this prototype isn’t nearly as far along as our voice-first work, but early indications are promising and present a rich and distinct design space that our team has been eager to dive into.

But Conlon, I can hear you say, whatever happened to the native mobile app we saw in your post two weeks ago?

Keen readers might notice that we’ve moved away from iterations on what we’ve been informally calling the “Voice + Visuals” channel, which would have likely been a native mobile application or mobile web app. There are several reasons for our decision, but the biggest was that our v1 testing challenged our hypothesis that visual feedback in the form of confirmation prompts would be helpful to applicants. It was actually seen as somewhat annoying, though multiple participants said they, “understood why [they] might need it.” While we’re still working on solving issues inherent in voice error handling in other ways, interactions in that style aren’t a solution we’re currently pursuing (especially not in the form of a native mobile app, which would create a number of other problems in terms of discoverability, install base, and maintenance, among other factors). You will, however, see some of the better-received features and functionalities of this prototype (notably, its improved ability to view and gauge progress mid-flow) make their way into future iterations of other

Who we’re testing with

For all of this to work, we need to make sure that we’re talking to people who might be among those that may use this tool in the future. To find that audience, we had to start with definitions from the PA SNAP Handbook and co-creation with our client to define who our “lead users” might be, all while factoring in COVID and WFH changing both our ability to recruit participants as well as potential changes to their home and work environments. From there, we narrowed our scope down to three primary groups of interest: single parents, senior citizens, and students. Additionally, we knew we wanted Pennsylvania residents (we’re designing for PA SNAP, and program implementation can very slightly state by state) who had minimal experience with the SNAP screening and application process before (limiting ourselves to “potential applicants” and those “awaiting decisions” on their SNAP applications from DHS).

Considering time spent online has spiked over the last several months, and the availability of Facebook’s all-too-robust ad targeting tools, we shelled out some of our team budget out for some artificial traction for our recruitment post on the social media titan…

*Participant recruitment via Facebook Ads (left) targeted at Pennsylvania residents over the course of a week (right)*

…and boy did it pay off. In just a week, we were able to screen hundreds of people to try to find those who would fall into our relatively narrow categories above. Parsing through our data (we primarily collect information on demographics, SNAP experience, as well as technological access, familiarity, and usage), we were able to identify almost thirty people across the three groups (students, seniors, and single parents) who live in PA and have minimal SNAP experience, with dozens more in tangential groups (either with more SNAP experience, living in adjacent states, or otherwise just outside of those narrow slices).

*Some questions from our Qualtrics survey (left) and some general demographics information about who we were able to reach (right).*

Having reached out to those participants of interest directly to set up Zoom interviews over the course of this week, we sat down as a team (in an unusual meeting late Sunday night) to review our finalized interview protocol and understand exactly…

How we’re testing

The synthesized feedback from v1 usability testing was instrumental in making a few major issues clear to us:

VoiceFlow, while an excellent prototyping tool for voice experiences, came up short in its ‘Alexa Skill Preview’ mode (which was the easiest way we found to share our skill at that point). People repeatedly ran into trouble enabling microphones, correctly triggering its listening mode, and getting the skill to properly recognize utterances. We needed to export our skill to a more mature voice recognition platform, like…
Alexa devices offer excellent voice recognition, but even if we could push our skill to their devices (there wasn’t for our skill, at least this early in development) assuming they had them at all.
To solve the hardware accessibility problem, we looked to the Alexa Developer console, featuring an embedded, full-fledged virtual Alexa even for skills not yet released or even in beta, but only accessible to our login credentials and a 2FA email sent to our team address. This made this theoretically perfect solution incredibly difficult and risky to share with participants in practice.

So we went back to the drawing board. In a non-COVID situation, what would be the obvious solution here? We’d simply deploy our development skill to our own devices and bring the participants to our physical location, and proceed with testing. As with many other in person interactions, we thought to apply the same problem solving framework here as we’ve seen countless others do to their problems in the past several months: how might we translate this testing protocol to Zoom?

*Screenshot from a mock interview, where Tommy is able to invoke our Alexa skill on Conlon’s Amazon Echo, remotely over Zoom!*

To be quite honest, we were pleasantly surprised at how well this seemingly bare-bones solution worked for our purposes. It eliminated any need for participant setup, offered excellent voice recognition in spaces with low to moderate amounts of ambient noise, and provided a more similar experience to a real smartspeaker than either of the software solutions above as it doesn’t feature a view of the conversation’s history (both Voiceflow’s Alexa Preview and the Alexa Developer Console do).

What we’re learning

Based on the first of the testing sessions scheduled for this week, where we had participants run through our Alexa skill over Zoom, we discovered some key initial takeaways:

Participants found it hard to remember what the specific prompts are to trigger next steps as the sentence length gets longer
There’s no chance for people to tell their stories about their financial burden
Talking about expenses and income is less preferable than typing responses in a public setting
Actionability is needed for people to determine next steps when receiving emails or notifications after screening

So far, it seems that our skill was easy to use, but it was not an enjoyable experience because it dived so deep into the low level details of one’s household configuration, expenses, and income to determine eligibility, rather than giving people the chance to tell their stories at the start about their financial burden or troubles in seeking for help. This may indicate the need to bring in elements of human connection and empathy into the forefront of our work, with the same focus as we’re putting in eligibility determination.

It was also interesting to note how sentiments around usage patterns changed when the participant was asked about how they might engage with the tool in public versus private settings, validating an anticipated need for privacy and desire to be considerate to others in public. We want to take these considerations moving forward and support them in our final design and future recommendations — perhaps enabling people to type in their responses for more private questions such as income on their phone while talking to an Echo device, or on a ATM-style virtual keypad attached to a smart speaker in a public setting. When it came to setting reminders in this process, it was interesting to find that one of the participants noted that reminders may seem as unnecessary since this interaction is short and is meant to do in one-sitting — otherwise, they wouldn’t do it. Here, we may have an opportunity to refocus on a short, streamlined flow to be completed in one sitting, reworking or dropping features like creating accounts or reminders to save certain responses.

We’ll be able to dive deeper into potential design decisions like the above once we work our way through the rest of our usability testing and synthesize our findings — a topic you should look forward to reading more about on this page, especially if you’re left wondering…

What’s next?

As our summer semester comes to a close, so does our research, project, and (wonderful) experience with MHCI and Gov AI. With that said, there’s still plenty to talk about! Expect updates over the next several weeks on:

A recap of our findings from v2 Testing, once it wraps up (expected to be later this week!), and how we’re adjusting our design and development processes to iterate accordingly.
Our final presentation, happening at the end of the month, which will mean a standardized style guide tying our deliverables together, a messaging plan around our design steep in our research, and collateral content to support our presentation in a variety of ways (from Zoom backgrounds to creative presentation agendas and beyond!)
…and likely other things, but saying “much more” at this point feels a touch ambitious. In the interest of only doing a few things, but doing them *really well,* perhaps expect just a handful of surprise inclusions in the next few updates that you (and possibly we) can’t anticipate at this point.

Thanks for reading, and remember; even if we have to tape a telegraph to a carrier pigeon, we will find a way to talk to you soon!

— Conlon, Laura, Tommy, Simran, and Judy