Thinking for Voice: Getting Started for Web Developers

Allen Firstenberg
Google Developer Experts
8 min readSep 27, 2018

“Thinking for Voice” is about more than learning how to program for Voice Agents like the Google Assistant. It is about how to make awesome Voice Agents because you can understand how they work and relate it to things you understand. This article is meant to help developers currently working on the web understand how Actions for the Google Assistant work and how you can approach writing your own.

When I first got a Google Home, I was curious what it was going to be like developing for it. I had been developing for the Web since pretty near the beginning, and I had watched how things had changed as mobile development became popular and more and more code was pushed to Javascript on web pages. What was going to be the path between what I knew and what I would need to know to build for the Assistant? Would developing Actions be like developing apps? Would they introduce something completely new and different?

It turns out they didn’t. Developing with the Actions on Google platform is remarkably like developing server-based webapps. While there are many differences between the two platforms, there are also some delightful similarities, so most web developers will find themselves in familiar surroundings.

Turning the “look and feel” into a Personality

The biggest challenge when thinking about voice isn’t a technical one — it is understanding how the design needs to change. Although Google provides a lot of suggestions for building good voice user interfaces (VUIs), there are a few things that we can already adapt from our time building good GUIs for the web.

On the surface, the two seem totally different. But there is a fundamental similarity: we want our website to have a distinct look and feel, so people will recognize it when they return to it, and in the same way we want our voice agent to have a distinct persona that people will associate with that Action. On websites, we use color and layout and our graphics to get the feeling of “who we are”, while our palette is very different when it comes to voice. We tend to rely more on the words and phrases we use rather than how it “looks”, since how we say things make our agent “sound different”.

In the same way that a website can look “professional” or “cartoonish”, the phrases we use, and how we deliver them, can make our agent sound “professional” or “silly”. Instead of colors, we’ll use vocal tones, sound effects, background music, and even the specific choice of phrases to help illustrate our points or call out important elements. Persona is vitally important when designing for voice, so make sure you clearly define it early in the design process, and then start building around it.

When we talk about designing for voice agents, we’ll often talk about scripting a conversation between the user and the Action. This is very similar to how our users interact with web pages. With both, we expect the user to initiate a conversation with us, and we reply. On the web, we reply with a page. With voice, we’ll reply with a message. This repeats until the user has done what they want. Sometimes, our conversations may go astray, and we need to guide the user back to what they want to do, but it is important to understand that our goal is the same with both, to help the user do what they want. We often talk about error pages on the web, but in most cases, they’re not really error pages, they’re just pages where we couldn’t quite do what the user wanted, so we need to get them back on track. Similarly, with voice, we don’t really have error responses in most cases, we just need to help guide them back to a happy path. We’ll talk about some other similarities in the conversation model shortly.

There are, however, two major differences between developing a GUI and a VUI.

The first is that GUIs can be information dense — they can show a lot of information on a screen and can collect a lot of information in one turn of the conversation. VUIs can’t be as dense. In general, we will ask the user for just one bit of information at a time, rather than a whole form, but we need to be prepared that they will give us more information than we ask for. For example, if we ask for an address, they may give us a full street address, or may just give us the city. We need to handle both cases.

The second big difference is in what users expect and what they will find jarring. Changes in a GUI are disorienting to users. If they visit a website frequently, or even if they use different websites, they expect input elements such as menus and forms to be in roughly the same place every time. If something changes, they will find it difficult to use the site. But when using voice to speak to someone, they expect variation in the responses they may get. Hearing the same prompts every time feels artificial and people will tend to tune them out. When designing your VUI, you should think about different ways you can say the same thing to the user, and how to vary between them appropriately.

The structure of a conversation

A conversation with our Action can start in several ways, but one of the most common currently is for the user to ask for it by name with a request such as “Hey Google, talk to Shakespearean Insult”. No installation is required. The Google Assistant knows what “Shakespearean Insult” is, since Action names are globally unique.

Web developers will immediately wonder “Is that like a domain name?” Indeed, it is! So much so, that some popular domain names aren’t allowed as Action names until you link your Action to the website using Google’s search console.

The parallels between the commands we issue to an Action and a URL don’t end there. During the conversation, each thing the user says is typically processed by a Natural Language Processing tool (NLP) such as Dialogflow. Dialogflow takes the phrase the user says, identifies an Intent that best matches the phrase, and extracts relevant parameters from it. It then sends this information, and some additional useful info, to a webhook for fulfillment processing.

That webhook, as the name suggests, is just a webapp listening into a well known public URL through an HTTPS server. That part should be easy for web developers, of course, but the similarity goes deeper. Typically, our webapps have paths or routes that we will map into functions to be executed. Often those routes will have parameters or, if we do it as a POST operation, the parameters are part of a form. In the same way, the NLP system maps a path (the phrase our user has said) into a single named intent with parameters which we will handle in a function on our server.

What gets sent to our server over HTTPS? Why, a JSON representation of all this information. What do we need to reply with? JSON again. As web developers, we should be fairly familiar with the different ways we can work with JSON. And just like building webapps, webhooks can be written in any language and run on any platform. Our only restrictions are that it must be publicly accessible, have a valid (not self-signed) SSL certificate, and must be able to accept and return JSON as defined by Actions on Google and Dialogflow.

Voice even has its own equivalent of HTML and CSS. Known as SSML, the Speech Synthesis Markup Language, this represents how the response is spoken, including how to pronounce numbers, acronyms, and other words. It even includes ways to play short form layered sound effects and longer form audio through an audio player.

Browsing by voice

You may be wondering what the voice equivalent of the browser is, since you probably don’t remember installing Voice Chrome or anything along those lines. The Assistant, itself, acts as the browser for your users when talking to your Action, managing the interaction during a session, allowing the user to create bookmark-like shortcuts, and even providing for other account-related activities. For example, your Action can store a limited amount of user-specific data, either for a session or until you or your user clears it, that only the user and your Action can access. While not a duplicate of how cookies work, there are strong parallels and they can be used similarly.

Our users can even discover our Action similar to how they do on the web. As we discussed above, if they know the name of the Action, they can access it directly, just like a URL. But what if they don’t? That’s where Google steps in and acts like… well… like Google, providing suggestions for what Actions might be able to handle what the user wants through Implicit Invocations. You can influence this, somewhat, by building an Action that can handle specific Built In Intents, and even see how your Action is performing by using some Analytics that are specifically tuned to examine how Actions behave.

The parallels aren’t perfect between the Assistant and your browser, of course. There are very few things that can run inside the Assistant itself, so don’t plan on running any Javascript, or much of anything else, client-side. On the upside, however, a user’s Assistant configuration works between platforms they may be on, so they can fairly seamlessly move between their mobile device or a smart speaker with very little work on your part.

The Conclusion… for now

Hopefully you’ve seen how many parallels there are between developing for the web and developing for the emerging voice field with Actions on Google. Many of the skills you’ve already mastered will help you as you develop an Action. In the future, I’ll discuss other components about Actions on Google and how you can use them when you “think” for voice.

I encourage you to visit https://developers.google.com/actions/ to learn more and begin to explore how you can use voice to interact with your users. I look forward to hearing what you build!

--

--