Creating a cross-platform voice app using Jovo

14 min readNov 23, 2017

I’m currently freelancing, and one of the projects I’m working on is integrating/extending an existing solution to voice interfaces, which includes both Amazon Alexa and Google Home. So when I came across Jovo, an “open source framework for developers to build cross-platform voice apps for Amazon Alexa and Google Home”, it naturally piqued my interest.

To kick its tires, I thought I’d create a very simple voice demo using the trusted Songkick API, my go-to API for quick hacks ;)

For the purposes of this demo, I just wanted to implement a couple of simple features:

ask the app to tell me about today’s gigs in a certain location
ask the app to tell me when an artist or band is next playing

Prerequisites

Jovo is a framework based on Node.js, so it requires that you have Node.js (version 4 or later) and NPM (the node package manager) installed.

Install Jovo

The easiest way is to install the Jovo CLI, like so:

$ npm install -g jovo-cli

You should then be able to run “jovo” on the command line, and see something like this:

$ jovoUsage: jovo [command] [options]Options:-V, --version  output the version number
    -h, --help     output usage informationCommands:new <directory>    Creates new project in directory
    run [webhookFile]  Creates a public proxy to your local development.
    help [cmd]         display help for [cmd]Examples:jovo new HelloWorld
     jovo new HelloWorld --template helloworld
     jovo run
     jovo run index.js
     jovo run --bst-proxy

Now we can create a Jovo project, let’s call it “gigsearch”:

$ jovo new gigsearch

This clones the Jovo “Hello World” sample app, and installs all necessary dependencies, so we should now have a functioning “hello world” project called “gigsearch”.

Set up the project in Amazon Alexa and Google Home

Before we look into Jovo any further, let’s create and configure our “gigsearch” project in Amazon Alexa and Google Home.

Amazon Alexa

Starting with Alexa, go to https://developer.amazon.com/alexa-skills-kit and sign in (if you have an Alexa device, make sure you sign in with the same account your device is registered with, so you can do local testing of the project with your device), then click the blue “Start a Skill” button.

On the next screen, make sure skill type “Custom Interaction Model” is selected, choose the language (I’m using English UK), and name the skill “Gig Search”. The name doesn’t really matter unless we’re going to publish the skill, which — given this is just a simple demo — we won’t, but if published, this is what users would see in the Alexa app. Let’s also use “gig search” as the invocation name — the invocation name is what a user has to say in order to launch the skill (for example, “Alexa, ask Gig Search who’s playing in London” or “Alexa, use Gig Search” etc.)

Click “Save”, followed by “Next”, where we’ll define the Interaction Model for the skill. Let’s use the Skill Builder GUI (you could also define the Intent Schema in JSON format, but the Skill Builder is probably easier to get started with). Click the black “Launch Skill Builder” button.

In the Skill Builder, we want to define 3 custom Intents for our skill (there are also 3 built-in Intents for Cancel/Stop/Help, which you’ll see have been added automatically). Click “Add an Intent +” within the blue “Intents” box.

Let’s call our first Intent “ArtistIntent”. This is the Intent that will come into play when the user asks for gigs by an artist (e.g. “when is Bob Dylan playing?”). Click “Create Intent”, and the next screen lets us define some Sample Utterances. The Sample Utterances are what actually lets Alexa map the request from the user to an Intent, in other words, whatever a user tells Alexa, we’ll want to be able to match to an Intent, which will enable us to know what we should do in response to what a user asks us.

Let’s type a first Sample Utterance, “when is {artist} playing”, and click “+” to add it. This tells Alexa that if someone says “when is Deafheaven playing”, they want to invoke the ArtistIntent of our skill, with the slot “artist” being filled by “Deafheaven”. Once added, you’ll see that a new slot called “artist” has been created:

Next we have to define the slot type for “artist”. Amazon provides a number of built-in slot types (you can see them when clicking “Choose a slot type…” underneath the artist Intent Slot), but in this case nothing really matches what we’re after, so we just create our own custom slot type:

Let’s call it “skartist” (for “songkick artist”) and click “+” to add. Once added, you can now see “skartist” listed underneath “Slot Types” in the left-hand sidebar. As it’s a custom slot type, we need to add at least one slot value. Note that the slot values we define here are simply examples, and we’re not restricting the user to the values we add here. Alexa is more likely to understand any values you add here, but will pass back to us whatever the user says, regardless of whether you’ve defined them here. To define a few values, click on “skartist” on the left and add a few random values:

Once these are added, let’s go back to our ArtistIntent (click “ArtistIntent” underneath “Intents” on the left, and add a few more Sample Utterances.

In reality, you’ll probably have a very large number of these sample utterances, to cover the numerous variations in which a user could ask the same question (and this is where the “Code Editor” part of the Interaction Model comes in handy, as it’s much quicker to add these in JSON that through a GUI), but since this is just demo, we just add a small number of example utterances.

So, this is our first intent complete, so let’s click “Save Model”.

Next, we want to create an Intent for users asking about gigs in a specific location, for example “who’s playing in Tokyo?”. We call this intent “LocationIntent”, and the process of defining this is exactly the same as creating the ArtistIntent above, just with slightly different Sample Utterances, and a “location” slot with custom type “sklocation”. Once you’ve done this, it will look something like this:

And finally, we want to create an Intent that handles users saying “next”, so they can skip to the next search result. All we need to do is create the Intent with name “NextIntent”, and a Sample Utterance of “next”:

Once we’ve done this, click “Save Model”, followed by “Build Model” — this will built the Interaction Model for our Alexa Skill. It will take a few minutes to complete.

That’s it for now, we’ve built our Alexa Skill Interaction Model, and we’ll come back for some more configuration options later on. Let’s move on to Google Home, where we essentially have to create the equivalent.

Google Home

Firstly, we create a Dialogflow agent for our project. Go to https://dialogflow.com/ and click “Go To Console”. This will prompt you to sign in — sign in using your Google account (again, if you have a Google Home device, or a phone running Google Assistant, sign in with the same Google account, so you’ll be able to test the project on your devices).

Click “Create Agent”, and let’s call it “GigSearchAgent”:

Once you create the agent, you’ll see that Dialogflow also has the concept of “Intents”, with a default WelcomeIntent and FallbackIntent provided. What we need to do now is add our “ArtistIntent”, “LocationIntent” and “NextIntent” that we already defined for Alexa. Make sure that you call the Intents by the same name as the Intents you defined in Alexa, because Jovo will need to be able to match the intents from the two platforms, so it can seamlessly handle Intents originating from both Alexa and Google Home.

Click “+” next to “Intents”, and create the ArtistIntent. Then we can add the “user expressions” (Dialogflow’s name for Alexa’s sample utterances). To do this, type the first one, e.g. “where is artist playing”, then highlight the word “artist”. This will bring up a pop-up, and again similarly to Alexa, Dialogflow comes with a number of pre-defined entities, but again none of them really match, so we use the “sys.any” generic entity. Type “sys.any” into the Filter box at the top, then click “@sys.any”.

The next screen should then look like this:

Change the parameter name from “any” to “artist”, and save the Intent. Make sure the parameter names (“artist” and “location” in our case) match the names of the slots we defined for the skill at Amazon Alexa, as again Jovo will pick up parameter values by name.

Next, create the LocationIntent in the same way, so it will look something like this:

Finally create the “NextIntent”, which just as it was in Alexa, is super simple:

Save the Intent, and we have finished setting up the language model for the project. Next, we have to connect the agent (a Dialogflow agent isn’t platform-specific and could be used in many different platforms, not only Google Home) to Google Assistant. Click on “Integrations” in the left-hand sidebar, and toggle “Google Assistant” to “on”:

Click “Update Draft”, and on the success screen, “Visit Console”.

So far so good, we now have both an Alexa and a Google Home project set up, and can go back to our Jovo project and look at the code.

Our Jovo project

Looking at index.js (for the purpose of this demo, everything we do is done in index.js), we specify our application logic underneath the “App Logic” banner (the code above this is the app configuration, which we don’t need to change for now). Currently it will look like this:

// =================================================================
// App Logic
// =================================================================const handlers = {
    
    'LAUNCH': function() {
        app.toIntent('HelloWorldIntent');
    },    'HelloWorldIntent': function() {
        app.tell('Hello World!');
    },
};

You can see a built-in LAUNCH intent, and a HelloWorldIntent. The built-in LAUNCH intent is what gets triggered when the app is first invoked without a specific intent (e.g. user saying “Alexa, use gig search”). Let’s change this so it will give the user an idea of what our app can do:

app.ask(speech, reprompt) first says what’s been defined in “speech”, and if the user doesn’t reply, follows up with the “reprompt” value.

Next, let’s delete the HelloWorldIntent, since we don’t use that, and instead implement the ArtistIntent and LocationIntent we defined in the interaction models at Alexa and Google Home:

All the example code shown (and some that isn’t shown) is avaliable on Github here BTW. A couple of things to point out in the code snippet above:

you can see we can pass the parameter names (artist and location respectively) to the Jovo functions that implement the intents. This is why the parameter and slot names defined in Alexa and Dialogflow, as well as the Intent names, need to match.
you can save session information with the Jovo app.setSessionAttribute(key, value) syntax, so you can refer back to these values later, within the same session. Once a session is terminated, either by the app or by the user, these values won’t be available anymore. Jovo also comes with features to store data permanently for a user, but I won’t go into this here.

Other than that, this is pretty standard stuff — we’re calling the Songkick API to look up artist id or location id. This is a 2-step process, the callback code will be something like this (I’m just showing the artist-related callbacks, the location intent works in a very similar way, and you can check the github repo to see the full demo code).

First we process the response from the Songkick API we called to search by artist name, and if we got a result, we call the Songkick API again, this time requesting any upcoming gigs for the artist ID we found:

(You can see we can now use app.getSessionAttribute(‘artist’) to retrieve the artist name the user was asking us to search for.)

Next, the callback from this API call (eventsearch_callback) processes any events that Songkick knows about for this artist:

A few things to note here:

since this API call returns all upcoming concerts for a band or artist, there may well be multiple results. We don’t really want to tell the user about all of them at the same time, so for the purposes of this demo (yes, I realise this is really simplistic), we tell them about the first event, and then prompt them to say “next”, which will tell them about the next event in the list, and so on. (This is why we created the NextIntent in the interaction models at Alexa and Google Home, and it nicely shows off the “State” concept in Jovo, more on this in a minute).
Because the API response from Songkick contains a number of upcoming events, we don’t want to call the API every time the user says “next”. Instead, let’s save the API response as a session attribute, so we can retrieve it later without having to call the Songkick API again. This works in principle, but I found a problem if the JSON string (i.e. the API response) is too large, and Amazon Alexa doesn’t handle it anymore. To work around this, I’m using the lovely json-mask Node module to extract only the parts of the JSON response we actually need to store, and then save the much smaller resulting JSON string as a session parameter.
We’re also storing a parameter called “resultNum”, so next time the same JSON is processed, we know which element inside the results we should use.

So the idea is that we tell the user about the first result we found, and then wait for the user to say “next”, which will trigger the NextIntent we specified. There’s only one problem with this — how do we know what the user is saying “next” to? Are they saying “next” because they want to see the next gig for the artist they were asking for, or are they saying “next” to see the next gig in the location they were asking for?

This is where Jovo’s concept of States comes in handy:

Rather than using app.ask, we can use app.followUpState(state).ask, which will pass on the “state” value as part of the request sent to Alexa and Google Home and save it as a session attribute:

app.followUpState('ArtistState').ask("The next show for " + app.getSessionAttribute('artist') + " is on " + date + " at " + venue + " in " + city + ". To hear the next show, say 'next'.", "Say 'next' to hear the next show, or 'stop' to end our session.");

The way we then handle the responses coming back to Jovo is to stick the Intents inside States — in our case we want our ‘NextIntent’ implemented separately both for ‘ArtistState’ and ‘LocationState’, since they behave differently.

… so now we have defined and implemented a “NextIntent” inside both the ArtistState and the LocationState. In the above example, if the user replies with “next” to app.followUpState(‘ArtistState’).ask(), the NextIntent implementation specified inside ArtistState will be executed, and not the one inside LocationState. Cool huh.

This is pretty much all the coding we’re going to do — of course we have barely scratched the surface of what Jovo can do, but if you’re interested in doing something more fully featured rather than a quick and dirty demo, their documentation is pretty extensive, and contains a number of tutorials too.

There’s just two things left to do now — hook up our app with Alexa and Google Home, and give it a test.

Putting it all together

By default, the Jovo app runs locally on port 3000. You can start it simply by typing “node index.js”.

$ node index.js
Local development server listening on port 3000

Of course this will only be reachable on your own machine, in order to make it available to Amazon Alexa and Google Home, we can use the ngrok NPM module.

$ npm install ngrok -g

Once installed, start it:

$ ngrok http 3000

… and it will show something like this:

ngrok by @inconshreveable                                                                                                                                                                                                                                 (Ctrl+C to quit)
                                                                                                                                                                                                                                                                          
Session Status                online                                                                                                                                                                                                                                      
Version                       2.2.8                                                                                                                                                                                                                                       
Region                        United States (us)                                                                                                                                                                                                                          
Web Interface                 http://127.0.0.1:4040                                                                                                                                                                                                                       
Forwarding                    http://b39cb958.ngrok.io -> localhost:3000                                                                                                                                                                                                  
Forwarding                    https://b39cb958.ngrok.io -> localhost:3000                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                          
Connections                   ttl     opn     rt1     rt5     p50     p90                                                                                                                                                                                                 
                              0       0       0.00    0.00    0.00    0.00

Once ngrok is running, if you start your Jovo project again, it will show output like this:

$ node index.js
Local development server listening on port 3000.
This is your webhook url: https://b39cb958.ngrok.io/webhook

It’s this webhook URL we’ll be using to configure our Alexa and Google Home projects, so they will send their requests to our app.

Amazon Alexa

Go back to the Amazon Developer console, and click on “Configuration” for our Alexa Skill:

Select “HTTPS” as the Service Endpoint Type, and enter the ngrok webook URL under “Default”.

Click “Next” and select “My development endpoint is a sub-domain of a domain that has a wildcard certificate from a certificate authority”:

Click “Next”, and bingo, we’re ready to test, either in the service simulator on the page you’ll see, or via any Alexa device that is connected to the Amazon account you’ve used to set up the Alexa Skill.

Google Home

We go back to Dialogflow, and click on “Fulfillment” in the left-hand sidebar. Enable “Webhook”, and enter our ngrok webhook URL, then save:

Finally we need to tell Dialogflow to use our Webhook for all the Intents we’ve specified, so we need to go into each of “Default Welcome Intent”, “ArtistIntent”, “LocationIntent” and “NextIntent”, scroll down to Fulfillment, and check the “Webhook” checkbox, then click “Save”. You need to do this for each of these Intents, as otherwise the requests for these Intents won’t be sent to our Jovo app.

That’s it — you can now test your app in the web-based “Simulator” you can see in your “Actions for Google” tab, or say “talk to my test app” to any connected Google Assistant device (e.g. a Google Home device, or an Android phone with Google Assistant).

Live demo!

And here’s what all this sounds like :)

Creating a cross-platform voice app using Jovo

Written by mario menti