Voice Technology is an Opportunity to Make Weird Stuff
Why it’s time to experiment with voice, and some technical tips on how to do it
I’ve recently been working with friends at the Google Creative Lab on something I’ve been interested in awhile: how people and computers talk to each other. It’s a theme I’ve played with in previous projects I made, but we explored it in a new dimension: voice.
Using a natural language API called Dialogflow (previously named API.AI), I wanted to see if we could make an open-ended guessing game, where the computer could satisfyingly answer any question you asked it.
I learned a ton of things about programming and designing voice interactions, which I‘ll get into below, but first I want to talk a bit about why it’s an interesting time to experiment with voice.
We are in the equivalent of the 1996 web design era of voice technology
Talking out loud to computers has always felt more science fiction than real life. But speech recognition technology has come a long way, and developers are now making lots of useful things with voice devices. These days, you can speak out loud and have your lights turn on, or your favorite music played, or the news read to you.
That’s all nice and good, but there’s something clearly missing: the weird stuff. We should make things for voice technology that aren’t just practical. We should make things that are way more creative and bizarre. Things that are more provocative and expressive, or whimsical and delightful.
We’re in what I’m going to call The 1996 Web Design Era of voice technology. The web was created for something practical (sharing information between scientists), but it didn’t take very long for people to come up with strange and creative things to do with it.
Today, voice technology as a medium feels similar to the web at that time — things are sort of broken and no one really knows what to do yet. The rules haven’t totally been written, and it’s kind of a mess. But this isn’t a bad thing. Like with all new technologies that came before, the messiness means that there is space for creative programmers and artists to come in and make interesting things with voice interaction.
Who’s going to be the JODI of voice? What is the zombo.com of talking to computers? When will we get the voice medium equivalent of @horse_ebooks? What are the rules of the voice genre, and how are we going to make fun of them, or break them?
I don’t know what the answer is, but I hope more people will take these tools and experiment with making things weird.
Why make art for voice interactions?
Speaking out loud is an intuitive and expressive way for people to communicate. It’s also more natural and human. When you’re doing a Google search, you’re likely to type something terse and efficient like “best tacos nyc.” But if you were asking a friend, you might say something like “Where do you wanna go for Mexican food tonight?”
Designing a creative interaction based on the way people normally speak out loud is an opportunity to engage with them on a more natural, expressive level. You also don’t have to teach people who aren’t familiar with gadgets and new technology how to use it—they can just talk.
Voice technology has a lot of potential particularly for programmers and artists who are interested in things like generative text, computer poetry and Twitter bots. Voice is a nice extension of the text medium for generative, interactive work.
Okay, so how do I make this stuff?
Mystery Animal is a guessing game you play with your voice. You can try it out on a Google Home, or on the web. I’ll write a bit about how we made it to show you how you can start tinkering with these new open source tools as well.
Mystery Animal is just an example of one not-practical thing you can make around voice interactions with freely-available tools. So I’m now going to dive in to some of the technical details about how we made Mystery Animal, with the hope that it can help others make even more creative and strange things with the same technology.
The technical nitty gritty of Mystery Animal
Tools and Infrastructure
How it works
- The player starts Mystery Animal. They start asking yes-or-no questions to try to guess what kind of animal it is.
- We use Dialogflow to figure out what they’re saying, and pass the relevant data into a webhook hosted on Firebase Cloud Functions.
- If we successfully figure out what they are asking, and we have the right answer in our data, we look up the answer in our animal data JSON.
- We then generate the proper response to their question, with some randomness provided by Tracery.
- If we can’t answer their question, we look up keywords in what they said in Knowledge Graph so we can at least acknowledge what they’re talking about when we respond.
Understanding human language
By far, the tool we used that feels the most like magic is Dialogflow, which is incredibly useful for figuring out what a person is saying. For this part, you don’t even need to code.
For example, someone might be trying to figure out the Mystery Animal’s size. They might say something like “Are you a large animal?” or maybe “Would you be considered a small creature?”
To handle that, I made an “intent” in Dialogflow that tries to capture all of the ways someone might ask about the Mystery Animal’s size.
The key word of what they say is saved as a
I wrote in a few examples of how people might ask the same question, and Dialogflow uses machine learning in the background to figure out similar phrases that are asking the same thing.
You can use “entities” to define synonyms for certain words:
Returning the right response
Dialogflow allows you to write give static responses without touching code, but in order for us to look up information and give it back to the player, we have to use a webhook on the backend. Ours is hosted on Firebase — you can read instructions on how to set that up here.
There are a few different ways to save and pass data via Dialogflow — the way that we did it in Mystery Animal was using the Node SDK’s built-in app.data object.
When a person triggers the “friendlysizemass” intent by asking about the animal’s size, Dialogflow passes a few parameters we set to our webhook. In my code, I set the “find_info” Action to trigger a function called
findInfo(), which, well, looks up the info.
So, if the player says “Are you big?”, the "friendlysizemass” intent is called, and in my code,
app.data.guess would equal
It then looks up what the correct answer is by checking the size of the current animal. If it’s, say, a giraffe, then indeed the animal is big, and hey, we know the answer is “yes.”
But it would be pretty boring if the Mystery Animal only ever said “yes” or “no.” We found that the experience is a lot more engaging if the responses from the computer acknowledged the question the person asked, and gave a response that varied in sentence structure from time to time.
So we wrote out a few templates for how it would respond to a question given the right scenario. Tracery, a wonderful generative text library, made it simple to add variety. These are the response templates for when a person asks about the animal’s size:
"#no#, I am not #guess#.",
"#no#, I wouldn't call myself #guess#."
"#yes#, I am #guess#.",
"#yes#, I tend to lean toward the #guess# side."
So if the Mystery Animal was a giraffe and you asked “Are you large?”, it would use the
response_true key, and it might say, “Indeed, I am big.” or maybe “Heck yeah, I tend to lean toward the big side.”
Having a fallback
Of course, this isn’t magic, and sometimes the player will say something that Dialogflow can’t figure out. Having a clever way for the computer to say “I don’t know” goes a long way in making it a better experience for the player.
The way we decided to tackle this in Mystery Animal was to use Google’s Knowledge Graph Search API, which can give you some information about a wide range of things.
It’s pretty simple — if the player says something that we don’t have an intent for in Dialogflow, we look up a description of what they’re talking about and generate a response that gives a nod to what the subject is.
So if you ask Mystery Animal, “Are you a fan of Beyoncé?” It will respond with something like “I believe you asked, ‘Are you a fan of Beyoncé’? You didn’t actually expect an animal to know about that American singer-songwriter, right?”
Repeat what they said
We found it to be extremely helpful to repeat what the person said back at them in some form when the computer responds. That’s why all of Mystery Animal’s responses are like this:
Player: Are you egg-laying?
Mystery Animal: Yep, I do lay eggs.
Player: Are you domesticated?
Mystery Animal: No, I am not tame, so please be careful around me!
It’s just much more satisfying to get a response when you know that the computer understood what you said. And if the computer hilariously mishears you, you’re more likely to forgive it if you know what it thinks you said.
Mystery Animal is open source so that people can learn how to do something similar. Feel free to take it and break it, and do things way more outlandish things with voice. Now is the time to make weird stuff — let’s make a creative community around voice technology together.
Thanks to Nick Jonas, Kelly Ann Lum, Prit Patel, Adam Katz, Julian Feller-Cohen, Patrick Irwin, Amit Pitaru, Alex Chen, and many others at Google Creative Lab who helped build Mystery Animal.