An adventure with Alexa in Java land.
Alexa and the Amazon Echo, in general, seem to be a hot toy in (the tech) town. Recently, I got my hands on an Echo Dot (2nd edition) and I found it a great device, especially for listening to music. Nevertheless, I believe that its voice recognition capabilities (you might need to say things more than once) and its “intelligence” (it doesn’t seem to understand natural language very well) still need work. I might not be fully convinced yet that voice controlled devices will be the next big thing (in their current form) but that doesn’t mean that playing and developing on them is not fun.
In this post, I will attempt to guide you through the basics of how an Alexa skill (this is how “apps” are called) can be created using Java. I might not get deep into technical details or provide step-by-step instructions. My aim is to give you a higher level overview of how a skill works. You can get more technical by looking into the excellent samples repository that Amazon offers.
Skill description
Let’s say we want a skill that will be able to accept a specific movie genre and return back some movies matching this criterion. For each movie, you will be able to ask for more details like rating and/or plot. Cards that contain the spoken information will appear in the Alexa app at the same time. Let’s call this skill “Movie Genius”.
Basic skill lifetime
@Override
public SpeechletResponse onLaunch(LaunchRequest request, Session session) throws SpeechletException {
[...]
}
This will be called if you launch your skill with no intent. For example saying “Alexa open Movie Genius” would call the above function.
@Override
public SpeechletResponse onIntent(IntentRequest request, Session session) throws SpeechletException {
[...]
}
Every other time you interact with Alexa the above function will be called. For example, if you say “Alexa ask Movie Genius for a comedy” or when you respond to a question this will be called.
The building blocks
The intents
If most of the times the onIntent(..) is called, how do you distinguish between what the user wants each time? The skill can set a number of intents that supports in the Intent Schema (this is a text box in the Amazon Developer Console):
{
"intents": [
{
"intent": "FirstResultIntent",
"slots":[
{
"name": "genre",
"type": "LIST_OF_MOVIE_GENRE"
}
]
},
[...]
{
"intent": "AMAZON.HelpIntent"
}
]
}
In this example there is 1 custom intent (FirstResultIntent) and 1 built-in intent (AMAZON.HelpIntent). The slots are a mechanism to pass additional parameters to the intent. For example in the phrase “Alexa ask Movie Genius for a comedy”, the “comedy” will be mapped to the “genre” slot. The LIST_OF_MOVIE_GENRE is just a list of possible values for the slot. Now, the onIntent(..) would look like this:
@Override
public SpeechletResponse onIntent(IntentRequest request, Session session) throws SpeechletException {
Intent intent = request.getIntent();
String intentName = (intent != null) ? intent.getName() : null;
[...]
if ("FirstResultIntent".equals(intentName)) {
[...]
}
else if ("AMAZON.HelpIntent".equals(intentName)) {
[...]
} else {
throw new SpeechletException("Invalid Intent");
}
}
The speech sample
OK, and how do you map actual user’s speech to these intents? Using the Sample Utterances (again, this is a text box in the Amazon Developer Console).
FirstResultIntent I want a {genre}
FirstResultIntent find me {genre}
FirstResultIntent for a {genre}
FirstResultIntent a {genre}
This shows Alexa how to map the phrase “Alexa ask Movie Genius for a comedy” to the FirstResultIntent and the “comedy” to the “genre” slot. For the built-in intents, you do not need to provide sample utterances.
Asking for an input
SpeechletResponse askResponse(String speechText) {
// Create the reprompt
String repromptText = "Can you say that again?";
Reprompt reprompt = new Reprompt();
PlainTextOutputSpeech repromptSpeech =
new PlainTextOutputSpeech();
repromptSpeech.setText(repromptText);
reprompt.setOutputSpeech(repromptSpeech);
// Create the question
PlainTextOutputSpeech speech =
new PlainTextOutputSpeech();
speech.setText(speechText);
return SpeechletResponse.newAskResponse(
speech,
reprompt);
}
// Make Alexa ask a question
askResponse("Hello! What kind of movie do you want?");
The SpeechletResponse object is what you return from the onIntent(..) and what Alexa will transform to speech to communicate with the user. Using the above snippet you can make Alexa ask the user a question to request an input. After Alexa playbacks your question, it will start listening for the user’s input. The reprompt text will be used in case it is not sure about the user’s input and wants the user to say it again. This is how you make a conversational skill.
Responding
SpeechletResponse tellResponse(String speechText) {
PlainTextOutputSpeech speech = new PlainTextOutputSpeech();
speech.setText(speechText);
SpeechletResponse response =
SpeechletResponse.newTellResponse(speech);
return response;
}
This can be used when you want to output your results without expecting the user’s input. You will be finishing a conversation with this response.
Cards in app
SimpleCard card = new SimpleCard();
card.setTitle("Movie Genius");
card.setContent(cardText);
[...]
return SpeechletResponse.newAskResponse(speech, reprompt, card);
Alexa comes with a companion app for iOS and Android that it’s used to set it up. The app can also be used to view Alexa’s activities. Your skill can send to the app useful information that the user might want to see in written form in addition to the speech.
Retaining state
You might want to retain state between responses so you can remember what the user told you in the previous questions. To do this use the Session object that is provided in onIntent(..) and onLaunch(..).
// Retain a movie so you can access it when the user
// asks for more info about it in the next interaction
session.setAttribute("ATTR_MOVIE", theMatrix);
// Retrieve the movie on the next onIntent() call
MovieSet theMatrix = (MovieSet) session.getAttribute("ATTR_MOVIE");
More complicated objects, such as collections, might not be serialized/deserialized correctly when setting/getting them as attributes. In this case, you can serialize then on your own (maybe using Gson to convert them to JSON).
Setup the infrastructure
An Alexa skill can be hosted on your own hosting provider or in AWS Lambda. The latter is the most popular option for hosting Alexa skills since Amazon offers for free the first 1 million requests every month. If you are not familiar with AWS Lambdas, it’s a cloud hosting service for your code that is billed only when it is called and running. That means that you are not charged for all the idle time that your code is waiting to be called. Also, it does not need any server configuration (except for that the amount of RAM needed that will determine how much time your code is allowed to run).
To prepare a Java program for deployment you need to create a JAR file with all the necessary libraries/dependencies included. This is a sample build.gradle file that creates such a JAR (look for the fatJar task).
Then you need to setup the AWS Lambda to serve your Java JAR. To do this, follow the instructions in the README.md of an Amazon’s java sample.(look for the AWS Lambda Setup section).
Finally, you need to setup the Alexa Skill in Amazon Developer Console. Again, follow the instructions in the README.md (look for the Alexa Skill Setup section).
Now, you will be able to test your skill on your actual device, from Amazon Developer Console (look for the Test section) or from the Echo simulator.
Conclusion
I know that this was not a complete tutorial but I hope that I did give you a glimpse/intro on how to make your own skill using Java. The full source code for the skill is on GitHub and live as well if you want to enable it in your Echo. In my opinion, the samples are the easiest way to learn how to do things for your Alexa skill. Do not forget that (if you live in the US or the UK) you can get something for just playing around with your device! Happy coding :)