The Sounds of Silence: Why Building a Sleep Helper Bot is Hard

GU Human Language Technology
6 min readFeb 15, 2022

--

By: Victoria Jin, Aine McAlinden, Stephanie Chen, Yifu Mu

Illustration by Stephanie Chen

Our team of five is creating a sleep helper bot named Luna. As a group of full-time students, we know well how important–and elusive–a good night’s rest can be. So we set out with one final goal: putting our users to sleep. But given the unique purpose of our bot, we soon realized that this would be a VUX project like no other. We quickly found that some aspects of the design process would deserve extra special attention for Luna — aspects that, for other voicebots, may not be quite as vital.

Overall, the user’s entire experience with the bot suddenly came into hyperfocus. It’s uber important that the user experiences a pleasant interaction with Luna if we want them to drift off into that sweet, sweet oblivion.

If a bot’s sole purpose is information retrieval (like our previous voicebot, DMV Metro), the interaction with the user can afford to be a little rigid, even clunky, as long as the bot ultimately provides the information that the user is looking for. Luna, however, was envisioned to coax the insomnia-plagued user into a relaxed, dreamy state in the dead of night. This problem can no longer be solved by only retrieving a piece of information. Here, every aspect of the user experience becomes that much more important. We can’t afford to introduce new frustrations or trouble the user with cumbersome dialogue.

This article discusses some areas of concern regarding design focus that we encountered while building Luna! Though these pertain specifically to sleep helper bots, they could certainly relate to many other voice user experiences as well.

Finding a Voice Actor

This season on: The Voice.

One of the most important aspects of a VUX is, without a doubt, the actual voice of the bot. Saturating the user’s entire experience, the voice is one of the determining factors in how the interaction will go and whether or not the user will be pleased in the end.

For a bot that’s designed to help a user drift off into dreamland, a stiff, robotic voice simply wouldn’t cut it. Ideally, the voice of Luna would be calm and soothing, smooth and mellow. As linguists, however, we couldn’t help but to ask ourselves: whose calm, soothing voice are we idealizing? Luna’s voice, like all voices both real and artificial, would be undeniably influenced by sociolinguistic variation. Factors such as age, gender, region of origin, ethnicity and more all have an impact on how a person speaks — as well as how they sound to others. For that reason, users will enter this experience with preconceived notions of how Luna should sound — what sounds calm and soothing for one person may not be so for another. We decided that a sleep helper would be most effective if it could be relatable and accessible to wide audiences of users, so we attempted to avoid polarizing decisions like strictly male versus female for Luna’s persona (inspired by the genderless AI project, Q). In the end, there was no way to design a voice devoid of social associations, but keeping these considerations in mind kept us accountable throughout the design process.

Ultimately, we realized that the many features we wanted to include while designing Luna’s voice would simply not be feasible without the additional financial support we would need to hire a professional voice actor. If we did have those resources, we would heavily invest in getting the voice of Luna just right — after all, this could be the factor that makes or breaks the user’s overall experience.

Lack of Resources for Sonic Design

(Not referring to that live action travesty)

The quality of Luna’s voice while executing various functions is integral to the voicebot’s effectiveness. For example, one of Luna’s functions is breathing exercises, during which a voice will assist the user to pace their breathing by counting seconds. Crucially, the voice must count in a way that sounds natural, not choppy and robotic (like the default option). The delivery is key, which makes a lack of resources for sonic design our current issue. So far, we have been doing the best with what we have to provide the best representation of what we hope to develop. Exploring the possibilities of SSML so far has been our main avenue.

SSML stands for Synthetic Speech Markup Language. In the context of our project, it is a tool for developers to make basic changes to Alexa’s voice. Like other markup languages (e.x. HTML, XML), SSML uses tags to encapsulate different attributes that could be tuned. SSML syntax can be incorporated into JavaScript strings in Alexa, and thus could be integrated relatively easily. For example, a snippet of SSML code can be seen below:

<prosody rate="x-slow" pitch="low" volume="x-soft"> 1, 2, 3, 4, 5, 6, 7. </prosody>

In this example, the goal is to change certain aspects of Luna’s prosody when she counts from 1 to 7. The opening tag consists of the name of the tag — “prosody” — and the attributes — “rate”, “pitch”, and “volume”. Right after is the speech text itself, followed by the closing tag </prosody>, such that the segment is closed properly. After hearing these instructions, Luna will make the appropriate modifications to the default voice — in this case, speak extra slowly (rate=“x-slow”), with lower pitch (pitch=“low”) and extra low volume (volume=“x-soft”).

Too much jargon? Here’s the gist. From our developer’s experience, SSML is a decent system of modeling speech sounds but far from robust. The transition between different annotated segments can be awkward, and it is difficult to find a voice that is appropriate for this specific project. We’re currently looking for other voice resources that could be potentially used out-of-the-box and could be integrated into an Alexa Skill project.

Why Ending the Interaction is Actually the Hardest Part, not the Easiest

It’s So Hard to Say Goodbye

Full disclosure: we want the user to be snoring at the end. But we also don’t want to leave Luna at attention for the entire night. Other bots have it easy — they may perform their function by providing information, and then they’ll ask the user if they would like anything else before signing off — that is the extent of the closing interaction. This is also normally the easiest part of the VUX to design. However, our job is slightly more difficult for various reasons. The user will have asked Luna to tell them a story, to lead them through breathing exercises, to play soothing sounds, or to tell them fun facts. Luna will oblige. But when should the bot pick up the conversation string again? Is it after ten minutes? Twenty? After the story finishes? And how should the bot proceed in a way that won’t disrupt the progress that’s been made so far? If the user is almost asleep, their brain emitting those lovely theta waves, we don’t want the sudden interjection of Luna’s “Would you like me to tell you another story?” to jerk them awake. And if the user is already asleep, we certainly don’t want to disturb them. But Luna won’t know when to sign off.

In light of these challenges, we’ve elected to go the timer-shut-off path: after completing the user’s request, the bot will wait for additional input from the user. If it detects that the user has fallen asleep, revealed by a certain period of silence, the bot will shut itself off. The user can always wake Luna again if they still need some help falling asleep. This seems like the best solution for us thus far in the design process. In the future, we envision linking the voicebot to wearable tech that can detect sleep (such as an Apple Watch).

Conclusion

This bot is a different beast, and the result will certainly be rewarding. Considering all of the factors that come into play when choosing a voice for this voicebot has opened our eyes to the many decisions that all conversation designers and VUX developers should be thinking about when giving their bots a unique voice. In the same vein, we’ve realized that we don’t have enough resources for sonic design in order to fully realize our vision for Luna, but we’re relying on SSML for the time being. One last key decision we had to make involved our experience’s ending. Interruption in a breathing exercise/story/soundscape is especially jarring, so we designed Luna to listen for a certain period of silence after completing the user’s request before shutting off. When all is said and done, we want both Luna and our user to be silent — save for those dreamy Zs.

We hope our article did not put you to sleep, but you can bet our voice bot will! Keep an eye out for more articles addressing other areas of our design process!

--

--