Building your first Action for Google Home (part 2)
In the first article, we setup an environment on Google Cloud, deployed our first Action on Google using Google Cloud Functions, and tested it using a text or web simulator or a Google Home device. Were you able to get the sample app working with the simulator? If so congrats, that was the hardest part; for now. If not, message me Jonathan Eisenzopf on Twitter and I’ll be happy to help you get it up and running.
Have you reviewed Nandini Stocker’s Design Checklist? If not, do that now before continuing. We will continue to touch on design best practices throughout this series of articles.
In this article, we’re going to take a quick look at the code and better understand how it works. Then I will walk you through the JSON responses in more detail.
To hear the application in production, say,
“Hey Google, I want to talk to Crazy House”.
Ok, that’s cute right? It’s a random room generator that has three doors to pick from. It’s a very basic app and somewhat entertaining. We’re going to use it for learning purposes, but just know that it lacks the conversational interaction and user engagement needed to create a great Action on Google.
There are several different ways to write an Action for Google:
- Roll your own code using the HTTP POST/JSON RESTful API
- Write an app using API.AI tools
- Use a third party tool that supports Actions on Google
The code that we’re going to look at that we deployed in part 1 of Building your first Action for Google Home is based on the first approach.
If you haven’t download the code yet you can get it from https://github.com/eisenzopf/google-action-three-doors or simply run:
As a refresher, we saw a sequence diagram in part 1 that shows the HTTP/JSON interaction between a Google Home device and our Google Cloud Function (Node.js) application. Let’s look at that again.
Do you see the Google Home device above taking a user utterance in step 1, “Hey, Google…” and then see how the Google Action Platform passes that to our Google Cloud Function, and then in step 2 how we pass back a question and the user picks a door? We’re going to look at the code now and see how that works.
If you’ve interacted with the application, you might be surprised to find out that the app is less than 50 lines of code (39 to be exact). I’ve kept the code compact and short for learning purposes, so just be aware that a real production-grade Action on Google will need to have more robust conversational capabilities.
Quick summary of index.js source
Now to the code (contained in index.js). Take a few moments to read through it. We will be referring back to it by line number throughout the article.
Here’s a quick description of each block of code:
- Lines 4–11 load default Conversation API JSON responses and the content that we will use to create prompts for the user. I’ve separated the content into rooms (different room descriptions), objects (that exist in the rooms), greetings (one of which we play at the beginning of the app), nomatch (that we use for conversational repair), and pickDoor (different wording asking the user to pick a door), and doors (different phrases that describe what happens when user opens the door).
- Line 12 loads the lodash library which contains a great set of text utility functions.
- Lines 14–16 exports our
three_doorsfunction, which is how we containerize things for the Google Cloud Functions platform. Note that the application type must be set to
application/jsonand we need to set
v1in the HTTP response header. You can find more docs on how to create Google Cloud Functions at: https://cloud.google.com/functions/docs/writing/
- Line 18 creates a reference to the part of the HTTP JSON request coming from the Google Home device that contains the user input string. This is done simply for convenience. You can see full samples of the JSON request at: https://developers.google.com/actions/reference/conversation#request
- Lines 20–23 check the value of
req.body.conversation.typewhich is the element of the HTTP JSON request that tells us whether this is the first call to the app for this user session. If it is 1 (that means it’s the first call for this session), we create a prompt that will be played back via text-to-speech to the user that consists of a concatenated string that picks a random value from the
pickDoorJSON objects using the
sample()function. If it’s 2, that means it’s a subsequent call for this user session.
- Lines 25–26 matches if the user says anything that includes the word stop. As a result, it will send back the
action_final_responseJSON, which ends the session.
- Lines 28–31 is a tricky bit because it’s the NLP part that can get difficult fast. This is the code that handles user input when the app asks the users to select a door. For the sake of brevity, I‘ve used a regular expression to match a myriad of ways that a user might select a door (1–3). This regex is not perfect, and it may not be the right NLP approach in some cases, however, you can play with it and see what it matches at http://regexr.com/3fjd6 where I’ve saved my session and you can further refine it to better match inputs and eliminate false positives. Please send me your improvements. If there is a match, that means the user (probably) selected one of the three doors and we randomly generate another room on line 30.
- Lines 33–37 handle the condition where it’s NOT the first call to the app AND the user’s input does not match a door AND they’ve not said stop. That means that they’ve said something else. Given that this is a sample application, we are not going to try advanced conversational repair like we normally would; instead we are going to select a random prompt from the
nomatchJSON file and ask them again to select a door. Just to re-emphasize here, this is bad design practice; if you do it in the wild your users will punish you.
- Line 38 sets the HTTP response code to 200, which tells the client that the request was successful. This also ends the Google Cloud Function code block.
That’s it, that’s the code. This gives you a basic start to play with and expand on. We will delve into debugging and testing in the next article. The rest of this article provides further explanation of the JSON response files to help you better understand how the Actions on Google Conversation API works.
Constant Declarations (lines 4–12)
The first constant
action_final_response loads a JSON file that is returned by the application if the user says something with the word stop as part of their response to a prompt for input. For example, I might say “I think I’m going to stop playing this game” as my response when the game asks me to select a door.
For a full description of the JSON request (what Google Home sends to our app) and response format (what our app sends back to the Google Home), see https://developers.google.com/actions/reference/conversation#request
The contents of the
action-final-response.json file are below:
The above is the JSON response that we send back to the Google Home device when a user says stop and conforms to the Actions on Google Conversation API. On lines 25–26 of index.js, you’ll find the code that performs the regex
match() function on the
userInput and writes
action_final_response to the Google Cloud functions
res object, which is the function that writes our HTTP JSON response. We use the
res.json function to explicitly convert the
action_final_response object into JSON format (even though it’s already JSON).
action_response constant loads a JSON file with a generic Conversation API JSON response that we will overload and customize for each prompt we play back to the user.
Refer to https://developers.google.com/actions/reference/conversation#http-response for more details on the JSON response format.
The contents of the
action-response.json file are below:
Here is a summary of the important bits of this file:
- Line 2
conversation_tokenis a state variable that you control. The client will pass back whatever you assign in the next request. You can use this to track where a user is in a conversation flow. We don’t cover using it in this example however.
- Line 3
expected_user_responsecan be set to true or false. Setting it to true means that the application will expect (and the Google Home will collect) user input.
- Lines 8–11 contains the prompt that will be played to the user before collecting input.
- Lines 13–23 contain (up to) 3 no prompts. These are played when input is not received from the user after 5 seconds.
- Lines 25–29 tells the client that we will be expecting text from the client. We will cover other options in future articles.
That wraps up part 2 of our series on creating your first Action of Google. In part 3, we will look at tools that we can use to debug applications quickly.
Feel free to reach out with questions, edits, or request for future topics. Just send a message on twitter to Jonathan Eisenzopf. If you enjoyed this article, please make it a favorite and share it with your friends.