Building a Google Home bot: Let’s code!
assistant.ask(‘Show me your code’);
This article covers the technical part of a prototype built by coworkers from Stink Studios at Google New York during a three day Workshop/Hackathon.
Disclaimer: This post is about beginning to explore the Actions on Google platform. It may contains errors or mistakes. Please, feel free to comment if you see something that seems wrong, or if you want to be part of the conversation.
There are a few things you need to know before starting to build a bot for Google Home, starting with a few definitions:
- Google Assistant: The service/AI used by Google Home, which is also available on Android phones. Using voice or text commands, you can interact with Assistant to get information about basically anything. It’s the equivalent of Siri on iOS.
- Google Home: The actual device.
- Actions on Google: A set of tools and resources to build an Action for Google Assistant.
We are building a Conversation Action (the term Agent is also used), which would be accessible on the Google Assistant. At the end of April, Google released the Google Assistant SDK, allowing developers to build physical prototypes that interact with the Google Assistant, including your Actions.
At Google I/o 2017, Google announced that Actions are now available on any device that uses Google Assistant. This means that any Action that gets built, once in production, will be available to millions of users.
However, Actions can’t be directly integrated into Google Assistant. Users need to use an extra step to interact with the Action called an invocation name. An invocation name determines the first interaction between the users and an Action, and also the personality of the bot.
For example, if you’re building an Action to learn if the F train is running late again, you can’t just say, “Okay Google, give me the status of the F train”. Rather, you’d need to say, “Okay Google, I want to talk to the Train Assistant”. Google Assistant will then start the Train Assistant service, at which time you can ask for the status of the F train.
Once your Action is approved by Google, it will also be available to anyone who uses the Google Assistant. Right now, there is no way to create an Action only available through your personal Google Assistant, but it’s a popular ask from the developer community and Google is working on it.
Google recommends using api.ai, a website that allows users to create conversational experiences across platforms. There are three mains steps to building a Conversation Action: designing the conversation, connecting it using webhooks to fulfill user requests, and launching it. Api.ai is just one toolkit available on the Internet, but I found it powerful and easy to work with.
Designing the conversation
Translating the conversation to api.ai
Once we feel strong enough with our conversation design, it’s time to start coding!
…Well, not exactly. First, we have to setup our conversation within api.ai.
Again, here are a few definitions to know:
- Agent: An Action in api.ai containing our intents, entities, and configurations. You can easily import and export agents.
- Intent: In api.ai, an intent is defined by a name (the name of an action intended by the user, like say_hello), a set of sentences that could trigger the associated action, an optional set of responses, and an optional context. Providing a solid set of trigger sentences allows api.ai machine learning to better understand user intent. The more you give, the better it becomes.
- Action: The name of the action that will be called if you use webhook fulfillments.
- Entity: A defining a “type” for a word or a group of words. For instance, 12 is a number, but in a contextual sentence, that could be anything from celsius degrees to an age. Api.ai has a list of pre-defined entities, but you can create your own. For Mood Mixer, we had to create 2 new entities : Mood and Genre. api.ai has a genre entity already, but we needed to define our own to match Spotify’s.
- Fulfillment: The configuration of the webhook, if you’re using one.
When you describe sentences in your intent, you can highlight words and define them as a specific entity. These can later be retrieved as “parameters” of the actions in your webhook calls, which we’ll see later.
Setting up the whole conversation takes some time and you’ll probably have to go back and forth between the actual code and the Agent to make sure everything works correctly.
Before diving into the code of your webhook, you can test your intents in the Google Web Simulator. For the development phase, first you need to create a project in the Google Developer Console and provide a Google Project ID. Setting this up is simple. All you need to do is click on “create new project” or link it to an existing one and you’re all set!
Next, go to integration. Enable Actions on Google, simply give an invocation name, and then click Authorize.
Then, you’ll be able to start talking to your bot in the Google Web Simulator. Important note: Preview is only available for 30 minutes, so you’ll have to re-click on the authorize button after that time expires.
What’s really cool is that, during these 30 minutes, your Action is available in your Google Assistant and in your Google Home. You can use the Web Simulator of course, but there’s nothing’s better than directly interacting with your Google Home in the context of a real conversation.
Is it time to code now?
Okay? Okay. Let’s do this.
In this section, we’ll be going over some basic concepts, but there’s more to cover such as deep linking, direct actions, login. You’ll find all of this information on the Actions on Google’s website, or let’s be honest, Stack Overflow.
I also find useful to use a Node JS version> 6.3.0 to be able to use the
--inspect flag and get a clean console.
Your server has to be accessible to your online Agent and use the https protocol. Use ngrok to easily expose your local server and put the generated url in the Fulfillment configuration of your api.ai Agent.
First, code a simple answer for your first intent, usually called Default Welcome Intent in your agent. By default, it is called the
input.welcome action. The answer is what your bot will say to welcome the user to the experience after they invoke your Action.
It seems simple, but as the very first moment of the conversation, the answers could be very different depending on the context. Is it the first time a user interacts with your Action? Is there a login welcoming answer replies with the user name? It’s also possible to welcome the user in different ways, depending on other factors such as the time or the weather.
Note: I checked “Use webhook” in the Fulfillment part. I also provided a sentence in the Text Response table that’s a fallback your agent can use if the webhook is not available.
First steps: Say Welcome
This is pretty straight-forward. Usually, I try to structure my application the best I can. Here’s a simple sample in which I isolated each action in its own file instead of stacking everything up in one giant file.
There’s only one entry point
app.post(/, fn) where everything is handled. I split up the logic to not end up with a giant file, separating the different intents I set up in api.ai and the possible answers.
When you create a new Assistant instance, you pass the current request and response objects, give it a sample
Map() of your Intents/Callbacks, and let it handle the output. It will automatically call the correct callback if the user intent corresponds to one of the intents you set up in api.ai, assuming Google Assistant catches what you said.
In this case, it will call the
welcome function, passing the assistant instance as a parameter. There are two methods you can use to give back an answer:
assistant.ask(answer): It’s not necessarily something to ask, but rather something that indicates your bot is expecting an other interaction from the user.
assistant.tell(answer): In this case, that would be the last thing your bot will say. It closes the current conversation.
Note: there are two ways to provide an answer: plain text, or SSML (Speech Synthesis Markup Language). According to Google, “By using SSML, you can make your agent’s responses seem more life-like.” You can play an audio file, have breaks, say correctly S-S-M-L instead of saying “smll”… For our prototype it was very useful, as we wanted to play a short preview of the songs you are getting for your playlist.
After completing all these steps, we now had a custom answer provided by our webhook. Let’s dig a little further and see how we handle the main request: getting a recommended playlist based on your mood, artists and/or genres you like.
The core of our Agent is creating a playlist based on your tastes. Conveniently, the Spotify API has the perfect end point: Get recommendations based on seeds. For the needs of the prototype, I used my own Spotify account and hardcoded a valid token and refresh token, but I plan to authenticate the user and store the personal token and refresh token for further use.
The Spotify API is really powerful, and these end points return a list of recommended songs based on genres (up to 5), artists (up to 5), songs (up to 5), and a list of parameters (acoustics, danceability, key, and loudness). We found a pretty good Spotify Web API Node JS library to use.
So, how do the moods fit in? Well, they are artificial. We defined a list of moods, and for each mood, a set of parameters to use for the API call.
These parameters are totally random right now, but could be adjusted if we wanted to go live.
Back to our Action. After the welcome intent, a user is asked by our bot “What mood are you in?”.
This is one of the most important parts of our concept. The user can formulate a demand for Mood Mixer by describing their mood as well as the artists and genres they like.
Here’s what it looks like in api.ai:
A couple of things:
- We used a set of different sentences to express this intent with three entities: genre (custom), mood (custom) and sys.music-artist (backed in). While we are highlighting the different entities in the sentences, we are also giving each entity a parameter name so we can access these entities as parameters later in the Node JS code.
- We named the action
And this is how it looks like on the Node JS side:
Yes I know, it seems like quite a lot, but it’s not really. Let’s break it down step by step:
requestPlaylist callback has been called, you can access
assistant.data. It allows you to find the parameters for which you are waiting, in our case, the
genres arrays. Actually, you are getting 2 versions of them:
- The (main) data: We defined mood and genre entities. For each entry, we defined a value and possible synonyms. For instance, “calm” is an entry we want to use to map the “calm” mood defined in mood.json, but you can say either “chill”, “calm” or “calming”. In our case, we are expecting “calm” as the main data, not any of the synonyms.
- The raw data: Exactly what the user said. This is useful if we want to give feedback to the user using their own words, even though we are using the main data for the main logic.
Another interesting aspect of
assistant.datais the need to figure out how to make data persistent across the requests. We haven’t tried cookie-parser yet. I’m pretty sure it would work and would be a more elegant and performant solution, but every request had a different header signature and we needed a way to keep track of data during the conversation.
In any case, you can use
assistant.data as an arbitrary data payload to maintain states. In my case, I’m using it for:
- Keeping track of where the user is in the conversation (should use
contextsfor this though)
- Persisting the current data of the playlist. I used
JSON.stringify()to carry them, then
JSON.parse()to decode the data. This allowed me to re-create an instance of the Spotify class over and over again without having to carry the instance by itself.
The rest of the code is pretty straightforward. After getting and formatting the data, as well as making some verifications and redirections if needed to other action callbacks, we built an answer based on what the user said and the result from the Spotify API.
assistant.ask(), we save the Spotify data and reset the parameters.
SSML to the rescue
Because we are building a playlist maker, one of the main concerns is if it will actually play a song, or a short preview because playing a full song would be really long and annoying for the user.
This is where the SSML format comes in handy. Using
<audio>, we can play an audio file from any source.
We decided to play 5 seconds, but the Spotify API, gives (when available) a 3o second preview for each song. Unfortunately, SSML format doesn’t allow us to play only 5 seconds out of 30, so we decided to save the preview on our server and use ffmpeg to re-encode the files to make them last only 5 seconds.
We could have gone through the whole code, but we prioritized cleaning it up and putting it open source as soon as possible. The idea here was to show how, from a simple idea, we could build a functional prototype.
We started looking for a firebase solution, using the Cloud Functions to execute our actions (samples available here) and the database/authentification system to link user’ Spotify accounts.
UPDATE: At Google I/O 2017, Google introduced a new Actions on Google platform integrated with firebase.
Ideally, we’d like to go further into the prototype concept by having a visual feedback (once again, just announced at Google I/O 2017), but I will talk about that in a future post ;)