Getting started with Voxa: Creating an Alexa skill Part 3
Check part 1 and part 2 of this blog post series.
A little intro to Voxa
Just how the documentation refers, Voxa provides a way to organize a skill into a state machine. With Voxa we can jump into specific states depending on what the user is saying, always relaying in the intents and utterances we specify for our skill. The use of a state machine allows us, the developers, to easily reuse code and not lose our minds on where the user needs to go in our skill flow.
MVC Pattern
Voxa uses the MVC pattern to organize the core code of your skill. Probably you have used or at least heard of the pattern but let’s review it pretty quick and see how Voxa uses this pattern to organize the code for a user interface app.
MVC stands for Model View Controller, a pattern to separate the code for our data, business logic and presentation of our data. In Web Development is really a common pattern and I will try to explain how Voxa uses it in voice using how most of the web frameworks use that pattern to build websites/web apps.
The View in a web framework represents the webpage that we return to the user’s browser. In Voxa the View is the response that we return to the user’s Alexa powered device, i.e. Alexa says the response that our skill returns to the user. The views in Voxa are stored in the views.js file
The Controller in a web framework represents the code that we use to get the data from our model, maybe do something with it and specify what view are we going to return. In the web development world, we separate our code into actions. In Voxa, the controller is where the state machine code lives and we separate our code into “states” and we return the response we want to return to the user using a transition object. The controllers in Voxa are inside a states folder that holds files of your states. In Voxa we have a naming convention of using a “.states.js” suffix in the files inside our states folder. It’s not something required but it’s nice to have when searching for states files in your code editor.
The Model in both web development and Voxa kind of the same as the layer to retrieve/store/delete the data from our database, or whatever we use to store and persist data. In Voxa, the model also serves as the layer to persist session data and the data will persist as long as the user doesn’t exit/close our skill. That way we can build skills without a database if it doesn’t require one. In fact, the skill that we will build doesn’t use a database.
Can we start coding, please?
All right, let’s dive into the action. You can always check the full source code of this skill if you get stuck in some part.
You’ll need Node v8 and a good understanding of JS ES6+ in order to do and understand this tutorial.
We need some boilerplate code to start our Voxa project. Luckily we have the voxa-cli to generate our starting boilerplate code to start coding our skill.
Open your console and run:
npx voxa-cli create
That will trigger the Voxa project generator; follow along with the questions of the generator. Set the name of the project to “Rock Paper Scissors” or whatever you like and hit enter for all the other questions for the default values, that will be enough to be up and running.
Open the “main.states.js” file inside “src/app/states/”, and delete the code that’s inside the register
function (don’t delete the register
function). Let’s add our code that will trigger when the user launches or open our skill. Inside the register
function copy this block of code (every block of code from now on will be only inside the register
function, remember that):
Let’s take a moment to understand what that code is about. The code will run when the user launches the skill because the LaunchIntent is the default intent when a user invokes your skill by saying the invocation name. Inside the arrow function, we return an object containing a “flow”, “reply” and “to” attributes; this object is the Transition object.
The “flow” attribute corresponds to the behavior for our state machine; “continue” means it will continue to the next state immediately and it is actually the default value if the “flow” attribute is not present in the Transition object (if you like you can remove it but we’ll keep it as reference), you can check the values for the “flow” attribute here, give it a quick read if you like.
The “reply” attribute corresponds to a Reply object or to the name of a view with a Reply object (I’ll explain the Reply object later) and it contains what we want to say to the user (we’ll create the Welcome view soon).
The “to” attribute correspond to the name of the next state that our machine state will move to.
So the meaning of that code is: when the user triggers the “LaunchIntent” our skill will say whatever the “Welcome” view contains and move to the “askHowManyWins” state. Pretty straightforward, right?
Let’s add the “askHowManyWins” state code, this one will ask the amounts of wins necessary to beat, or be beaten by, Alexa in our rock paper scissors skill.
Let’s pause quickly here. There are two key differences between our “LaunchIntent” state handler and this state.
The first one is that we declared our “LaunchIntent” handler using the onIntent
method, meanwhile the “askHowManyWins” handler is declared using the onState
method. The onIntent
method is used to declare code the will run when the corresponding intent is triggered at any time in your skill. You don’t have to declare a onIntent
handler for every intent in your skill, just for the ones that you need to handle if the user triggers them at any time while running your skill. For example: intents to exit the skill or asking for help in your skill needs to have their own onIntent
handler. Intents like choosing the rock option or the paper option are more suitable to handle inside a state declared using onState
since we need to handle them while playing the game, so in a specific part of our skill flow. I think you’ll understand better this key difference as we progress in finishing our skill.
The second one is the value of the “flow” attribute of the Transition object. This has a “yield” value, what does that mean? It means that the skill will pause here and wait for a response from the user. The response will be handle by the “getHowManyWins” state which is defined in the “to” attribute.
So in short, when the user launches our skill, the skill will say whatever it is in the “Welcome” view and then say whatever it is in the “AskHowManyWins” view and then pause waiting for the user to talk back.
To save some time in the views, I’ll just leave the entire code for our views, since they are the replies that we’ll return. Open the views.json file (inside src/app/ directory) and paste the following code. You are more than welcome to edit the text inside the “say” attributes the way you want.
As you can see, the “Welcome” and “AskHowManyWins” views are at the top of the views file (the order is not important, just referencing the order so you can find them easily). So when launching the skill, Alexa will say: “Welcome to Rock Paper Scissors! How many times do I have to beat you in rock paper scissors to be the ultimate winner?” (Wow, that was cheesy 🙄).
A pause to understand the views and the Reply object
TL;DR: Every view can have a Reply object that needs a “say” attribute which holds the phrase that Alexa will say in the response. We can also set “cards” and “directives” in the Reply object and these ones serve as additional content to view on a screen like your mobile phone or any other device with a screen (like the Echo Show for example). It can also hold a “reprompt” attribute which is a phrase that Alexa will say after the phrase in the “say” attribute if the user doesn’t respond back to anything.
Let’s understand what the Reply object is about so you can understand how views work. The views, as you can see, are inside a json file with the language and a “translation” attribute (this is because of internationalization which is built-in in Voxa, useful if your voice app is deployed in several languages). These two attributes are the only ones required in the views structure, inside of them are the view names which can hold whatever name you want and can be structured as you like. For every view you can set a Reply object.
Just like the Voxa docs says about it, the Reply object “it takes all of your statements, cards and directives and generates a proper json response for each platform”. What this means is that here we can define everything we need to return as a response to the voice platform.
Every voice assistant has a mobile application and we can return “cards” that will render in the mobile app. Some devices have a screen and we can return “directives” that serve to render something in the screen alongside the response. And of course, there’s what the assistant will say which is a simple phrase that we set in a “say” attribute. There’s also the “reprompt” value which is very useful to set if we need a response back from the user; if Alexa says something and waits for the user to respond back and the user doesn’t say anything, she will say whatever it is in the “reprompt” attribute (the usage of reprompt is very good for voice experience). What we care are the “say” and “reprompt” attributes, since our skill is very simple and we care that there’s something to say in the Reply object for every view.
Back to code our skill
So Alexa will respond this: “Welcome to Rock Paper Scissors! How many times do I have to beat you in rock paper scissors to be the ultimate winner?” and then she will wait until the user responds with amounts of wins, the response will be handled by the “getHowManyWins”. If the user responds correctly, Alexa will return the “MaxWinsIntent” (we defined it in part 2) with the slot value which is the number. Let’s create the “getHowManyWins” state to handle that request:
Let’s understand what’s going on in here. First of all, we define our state code just like our previous state except that there’s a voxaEvent
pass to the callback function, that variable holds all the information of the request and the skill session; you can read more about what information you can access in the docs. Since we have our voxaEvent
we can check information like the triggered intent name, why we do that? In our previous state, we ask for the number of wins in order to beat the game. In this part the user can say anything, we expect the user will say an utterance defined for our “MaxWinsIntent” but what about if the user says “Yes” or “What?” or whatever phrase that is not in the utterances defined for our expected intent? Then the code inside that state should not run, that’s why we set an if statement to check if the intent we’re getting is the one we expect in that state.
After that, as you can see we start using the model inside our code. That model is to persist data that we will use in our entire session and it’s available in the voxaEvent
. We initiate the variables inside our model which are the number of wins that we get from our “{wins}” slot, and we also initiate the user and Alexa wins that should be 0 from the start. We return the “StartGame” view and continue to the “askUserChoice” state.
Now the “askUserChoice” state is a state that we will reuse to ask the user choice in every match so we need to code it in a way to check if the user has won or Alexa has won. If neither has reached the winning amount then we get a random choice for Alexa and ask the user for his choice.
After asking the user for his choice, we process the choice in the “getUserChoice” state, let’s create it:
So for this last block of code in the “getUserChoice” state, we check what intent we got, according to the intent we define the choice, and if the choice is defined we go and process who won in the “processWinner” state. Something important to notice: we didn’t define a view in the Transition object, why? if we have flow: “continue”
, it means that it will continue to the next state defined in the “to” attribute, it’s not necessary to define a view if that’s the case. The view is necessary when having a flow: “yield”
or flow: “terminate”
.
Let’s create the code to process our winner:
This was a very large chunk of code, but in short what it does is:
- Get Alexa’s choice and the user’s choice.
- Check if the choices are the same. If they are, it’s a tie, ask again for the user’s choice.
- If the choices are not the same, check who won and increment the “wins” number for the corresponding winner.
- Ask again for the user choice.
Something very important to notice here are the views we return when the user or Alexa wins, why? because of the hold variables. Let’s check the “AlexaWins”, “UserWins” views to understand what I mean:
UserWins: {
say: 'Oh, you won this point! You have {userWins} wins and I have {alexaWins} wins.',
},
AlexaWins: {
say: 'I won this point! You have {userWins} wins and I have {alexaWins} wins.',
},
As you can see we can return views with variables on them. In this case, we say the amounts of wins for Alexa and the user inside these views. But to set the values for those variables we need to define them inside the variables.js file, open the file (inside the src/app/ directory) and replace whatever code is on the file for this one:
The variables.js file has the variables used in your views and generally we use the values in the model to set the values for those variables. The variables can be functions that need to return something, Voxa make sure that the voxaEvent
is passed as the first param of the function in order for you to retrieve any value from the model or any other place in the voxaEvent
. Voxa will replace the value we return in the variables in the view if the view has a string inside curly brackets (“{}”) that has the same name of a variable inside.
So in our views there’s “{userWins}” and “{alexaWins}”, Voxa will look in the variables for those names a replace them in the view with whatever value the variable has or returns. In this case will look for the values inside our model for the user and Alexa wins.
Now the flow of the state machine goes to the “askUserChoice” state we made before, so we’re reusing code using the logic of state machines, pretty cool right?
In the “askUserChoice” state we check if the user or Alexa won and reply the corresponding view of the winner. If someone won then we ask if the user wants to start a new game in the “askIfStartANewGame” state, let’s create it:
Then we check if the user said “yes” or “no”; depending on the response we start a new game or, we say goodbye on the “shouldStartANewGame” state, let’s create it:
Very important things to notice in the “shouldStartANewGame” state
- We check for the “YesIntent” and the “NoIntent”. If you remember part 2, when we added those intents, the names were: “AMAZON.YesIntent” and “AMAZON.NoIntent”, so why we don’t check for those names? this is something that Voxa does, it removes the “AMAZON.” part from the built-in Amazon intents; this is because Voxa is a multi-platform framework, you can create Voice apps for Google Assistant too. So, in your Google Assistant app you can have “YesIntent” and “NoIntent” and they can handled by the same “if” statement. It’s not necessary to check for the “NoIntent” in Alexa and the “NoIntent” in Google (😖), you can safely reuse the same logic for those intents in both platforms (😁).
- If the user triggers the “NoIntent”, the “flow” attribute has a “terminate” value. If you read about the “flow” attribute in the Voxa docs, skip this point. The “terminate” value is used to finish the skill session, it’s the way to exit the skill.
- The response that exits the skill doesn’t have a “to” attribute. I think you can guess why we don’t define the state, it’s because we’re closing the skill, there’s no point to defining a new state to go to.
We can also add code if the user wants to quit the skill in the middle of a game. We create handlers for the “CancelIntent” and the “StopIntent”. Those intents are added by default in every Alexa skill.
We finished our skill!
You can check the source code on Github for the complete project with all the code we just wrote. If you get stuck at some point, check the corresponding file you’re editing.
Test your skill
To test our skill we’ll start a local server by running yarn watch
or npm run watch
in the root of our skill. Make sure you create a local.json
file using the same data from the local.example.json
file inside the config folder.
We’ll use ngrok so Alexa can communicate with our local server. Follow the steps to get ngrok up and running in your machine and start a tunnel by running ngrok http 3000
, copy the URL with https and go to the Alexa console of your skill.
Inside your skill console, in the menu at the right click to Endpoint. Choose HTTPS in Service Endpoint Type. In the Default Region text input, paste the URL from ngrok with the “/alexa” concatenated at the end; so it’ll be something like “https://f0dafff0.ngrok.io/alexa”. Then choose “My development endpoint is a sub-domain of a domain that has a wildcard certificate from a certificate authority” in the Select SSL certificate type dropdown. Click on the Save Endpoints button at the top.
Now go the Test tab in the top navigation, enable testing for your skill by choosing the “Development” option in the dropdown (You can give permissions to the page to use your microphone if you want to but it’s not necessary).
Now you can test your skill by typing (or saying if you use your microphone) “start rock paper scissors”.
Play with it for a while; if Alexa says “An unrecoverable error occurred” then something in the skill code is bad. Check the Github repo to compare the code, if you’re stuck let me know in the comments.
Conclusion
Wow thanks for getting through all the tutorials, I hope you take the knowledge to built a great voice experience using Voxa. Remember to check the documentation (we’ll be building a better docs site very soon 🙏).
Voxa it’s a voice framework in constant development and has been used to built skills for brands like Campbell’s Kitchen, MARS food, Headspace and many more.