Conversation to Code (Part 2)

12 min readDec 18, 2018

In the first article in this series, “Thinking for Voice: Design conversations, not logic”, we talked about how the best first step in creating conversational interfaces is to build the conversation. People still wanted to skip this step and get towards the code, but we held firm. When last we left our intrepid developers, in “Conversation to Code (Part 1)”, we looked at how to use Dialogflow to handle the Natural Language Processing to determine the Intent that the user expressed in the conversation. Now, in our last chapter, we’ll take those Intents, apply some programming logic (finally!) and return the results.

The bulk of the work in Dialogflow is to take the user utterances, phrases in everyday language, and boil them down to discrete Intents. These Intents turn how the user is trying to say something into a single representation of exactly what they are trying to say or do.

When we designed the conversation, we concluded that these Intents could, potentially, change what the current state of the system was, and that our replies are based on the current state of the system (that may just have been changed) and what the Intent was that triggered that change. We now need to build the logic that will change the state and determine the reply.

Dialogflow can’t do this itself. Its job was to determine the Intent, and then hand the information off to something else to apply this logic and otherwise process it. This processing is known in Dialogflow as “fulfillment”, and the code that implements the logic runs on a web server and is known as a “webhook”. Dialogflow calls this webhook by issuing an HTTPS call, sending a JSON body with the Intent name and other information, and expects JSON in response containing state to be maintained and the reply that should be sent to the user. Dialogflow forwards this to the Assistant.

Our task at this point, then, is to create a webhook that will get the Intent and current state, change the state, and send back a reply. Fortunately, we have a lot of tools available to help us with this. We’ll look at two of those tools, the actions-on-google library for node.js and the multivocal library for node.js. We’ll also briefly touch on how to do this in other programming languages. (There is a third popular library for node.js called dialogflow-fulfillment. It works very similarly to actions-on-google, at least in concept.)

Fulfillment with the actions-on-google library

There is some boilerplate that we’ll be setting up when we use the actions-on-google library. We need to import the conversation object that can work with Dialogflow and setup the component that listens for HTTPS messages using Firebase Cloud Functions.

// Import the Dialogflow service function
const {dialogflow} = require('actions-on-google');

const action = dialogflow();const functions = require('firebase-functions');
exports.webhook = functions.https.onRequest(action);

We’ll also use theaction object to register all of our Intent handlers. We have one for each Intent, registered against the name of the Intent, and calling a function that takes a conv parameter representing specific state about this conversation with user (if you’re familiar with HTTP, it contains both a request and the response object).

For the “welcome” Intent, it might look like this:

action.intent('welcome', conv => {
  let replyState = setReplyState( conv, 'prompt' );
  let intent = getIntentName( conv );
  sendReply( conv, intent, replyState );
});

In this handler, we call three function which we’ll discuss in a little bit, but they do exactly what we determined they would do as we designed the conversation:

Change our state to “prompt” (and store what it is for use in the reply).
Get what our current Intent name is to use in the reply.
Send a reply (through the conv object) based on our Intent and state.

Most of the other Intent handlers look similar. You might remember that our “yes” Intent is slightly different. Instead of changing our state, we just need to get what it is. So it looks a little different

action.intent('yes', conv => {
  let replyState = getReplyState( conv );
  let intent = getIntentName( conv );
  sendReply( conv, intent, replyState );
});

The methods that set and get the reply state, and the method that gets the Intent name are all fairly straightforward. They simply access properties that are set on the conv object by the library. In the actions-on-google library, there is a conv.data property that contains data that we want preserved for this session.

function getReplyState( conv ){
  return conv.data['replyState'];
}

function setReplyState( conv, state ){
  conv.data['replyState'] = state;
  return state;
}

function getIntentName( conv ){
  return conv.intent;
}

The sendReply() function is a little larger. It will reference an allReplies object that contains an array of replies for each possible state (plus a special one for the welcome Intent). We have more than one reply for each state since we don’t want to sound like a robot, saying the same thing every time, it is best practice to vary our replies. Our sendReply() function will pick one of these.

const numberReplies = [
  "42 is the answer. Interested in another?",
  "21 is blackjack. Would you like to hear another?"
];

const allReplies = {
  welcome: welcomeReplies,
  prompt:  promptReplies,
  letter:  letterReplies,
  number:  numberReplies
};

(Other reply arrays have been omitted for brevity.)

Finally, we get to the sendReply() function. As the comments indicate, it determines which set of replies to use from allReplies, picks one at random, and sends it back to the user.

function sendReply( conv, intent, replyState ){

  // Replies are usually based on the reply state,
  // unless this is the welcome intent.
  let repliesNamed = replyState;
  if (intent === 'welcome' && replyState === 'prompt'){
    repliesNamed = 'welcome';
  }

  // Get the replies associated with this name
  let replies = allReplies[repliesNamed];

  // Pick one of them randomly
  let pick = Math.floor( Math.random() * replies.length );
  let reply = replies[pick];

  // Send it
  conv.add( reply );
}

You can see all the code at https://github.com/afirstenberg/examples/tree/master/conversation-to-code-2-aog.

There are a few things about this approach that are worth mentioning:

They directly map to our design and to the conversation. We can go back and look at the conversations we talked about and verify that we are doing what we said we would do in each step.
Our Intent handlers don’t, directly, send a reply. They collect the information needed to create a reply. A separate function is responsible for actually making the reply based on these components.
A result of this is that our business logic is separate from the specific messages we have to say. This is highly desirable.
Intent handlers can call other functions. There is nothing magical about a handler, since it is just a function itself.

If we were using the dialogflow-fulfillment library, we would find much of the code would be very similar. The biggest differences are in the boilerplate code that is necessary and the parameters that are sent to the handlers. The logic, however, can easily be adapted for either library.

There are some things that point to improvements that we could make:

This works well in English, but how will we handle Internationalization?
Our replies are all strings. What do we do if some portion of the reply needs to come from the user or a database? Is there an easy way we can template this?
If we need to add or change our replies, do we need to re-release our code?
All of our replies give an answer, but then prompt the user for something. Is there a way we can separate these two parts where appropriate?
The sendReply() function looks fairly generic, so in theory we can use it for other Actions we may write, but there are some parts of it that are very tailored for this one (handling the “welcome” Intent). Is there a way we could improve this process?

While we can certainly write more code to address these issues, and the code we’ve written here can be adapted to do so, they also suggest that perhaps these and other best practices can be encoded directly in a library.

A different approach with multivocal

The multivocal library seeks to codify many of these best practices, making them easier for developers to take advantage of. Some design elements of the library encourage this. For example:

Replies are meant to be done through configuration, rather than code. This configuration can be in the code, stored in a database, or some combination.
In addition to replies, developers can also configure a list of suffixes to the replies that will be appended.
Both replies and suffixes are are actually templates, drawing their values from an “environment” that is initially populated by the library based on the values passed from Dialogflow and the configuration.
Developers are encouraged to add values to the environment that are used by the templates to determine which replies and suffixes should be used and what information should be filled into the templates. Once eligible replies/suffixes are determined, the library picks one and sends it to Dialogflow.

Despite these additions, we’ll see that many aspects are very similar between the actions-on-google library and multivocal. If anything, multivocal has taken the concepts and simplified them.

For example, the setup boilerplate has been reduced to a bare minimum.

const Multivocal = require('multivocal');
exports.webhook = Multivocal.processFirebaseWebhook;

All Intent handlers are passed the environment and are expected to return a Promise that resolves into an environment. If necessary, the handler may modify the environment, and several default handlers exist to do this.

function stateDefault( env ){
  env.Outent = `Outent.${env.Session.State.replyState}`;
  return Multivocal.handleDefault( env );
}

Multivocal.addIntentHandler('welcome', env => {
  env.Session.State.replyState = 'prompt';
  return stateDefault( env );
});Multivocal.addIntentHandler('yes', env => stateDefault(env));

We also see how the environment env is used in stateDefault() and in the Intent Handler to build additional values in the environment from other values that have been set.

The environment setting “Session/State” contains state that we (and multivocal) preserve during this session with the user. We can set it to a particular value by setting env.Session.State.replyState to the value we want in the handler.
We’ll then use this in stateDefault() to set the “Outent” setting to a string based on the value in the environment (the newreplyState that we may have just set).

You may have noticed that some environment settings start with capital letters. Multivocal reserves all environment values with capital letters for its clearly defined use, guaranteeing that most things that start with a lowercase can be used by you without conflicting with things in the future. In the example above, “State” is defined by Multivocal to be used to store the user or session state, but “replyState” is guaranteed to be available for us to use for our own purposes.

The “Outent” environment defines which replies will be used. Multivocal will search for replies based on what is set in the “Outent” environment and, if not found, will then search for replies based on what is in “Action” and “Intent”. Since we are defining “Outent” in all of our handlers, we’ll need configuration that stores this. To look at just a part:

const undResponsePrompt = [
  {
    Base: {Set:true},
    Criteria: "{{is IntentName 'welcome'}}"
  },
  "Welcome!",
  "Ahoy!",

  {
    Base: {Set:true},
    Criteria: "{{isnt IntentName 'welcome'}}"
  },
  "Ok.",
  "Gotcha."
];

const undResponseLetter = [
  "The letters A and I are also words.",
  "The letter W is called a double V in French."
];const config = {
  Local: {
    und: {
      Response: {
        "Outent.letter":  undResponseLetter,
        "Outent.number":  undResponseNumber,
        "Outent.prompt":  undResponsePrompt
      },
      Suffix: {
        "Outent.letter":  undSuffixLetterNumber,
        "Outent.number":  undSuffixLetterNumber,
        "Outent.prompt":  undSuffixPrompt
      }
    }
  }
};new Multivocal.Config.Simple(config);

All the code and configuration is available at https://github.com/afirstenberg/examples/tree/master/conversation-to-code-2-multivocal.

Although this may look complex, it is defined to allow flexibility and follow some internationalization best practices. This part of the configuration contained localized data (hence “Local”) for an undefined (“und”) language and region, so if no other locale matches, this will be used. We are defining both responses and suffixes for specific outents. While this is a fairly simple configuration, stored in our program, other configuration options exist that allow us to store some or all of our configuration in a database such as Firestore.

The undResponseLetter seems simple enough, but why is undResponsePrompt more complicated? And where do we explicitly define the prompts for our welcome Intent? We have moved the logic for determining which prompts to use during the “prompt” state into the configuration itself. The objects containing the Base and Criteria parameters indicate that the values following them should only be considered in certain cases. In these cases, if the IntentName environment setting is (or isn’t) set to “welcome”.

The multivocal library has all the same advantages that we saw with our earlier design using actions-on-google, plus we see several additional advantages:

Internationalization is built-in. All we need to do is add additional templates for additional locales. The library takes care of the rest.
Replies and suffixes are templates, rather than strings, that are populated based on values we can set in the environment.
We can also change the templates without having to redeploy code by storing the configuration in a database.
The library takes care of boilerplate, letting us focus on our business logic.

Although both libraries have efforts underway to provide support in different programming languages, these efforts are still in their early stages. Until they are realized, can you build an Action in another language, or are you forced to use node.js? (Spoiler: any language!)

Logic in any language!

If you’re not using node.js, however, you may be wondering how you can apply these same concepts to your own language. Since the concepts themselves are pretty straightforward, you should be able to easily adapt it to your environment of choice while the library support catches up.

There are a few things you should be armed with that are language specific:

You’re going to need to be able to setup a web application or webhook on a public HTTPS server somewhere.
You should know how to get the request that was made to this webhook and how to send back a reply.
The request body will be encoded using JSON, so you should know how to decode this into something your language will understand (usually an associative array or map object) and be able to generate JSON as a reply.

Some examples of the JSON formatted requests and responses through Dialogflow have been documented, and serve as a good start to deciphering what you get, and ensuring that your responses are valid.

As part of the response, make sure you set the conversation state so you’ll be passed this state on the next round. This is best done using a Dialogflow Context.

Beyond this, the procedure is straightforward, but will depend a little on your exact programming environment:

You’ll get the Intent name and the response state that you’ve stored in a Context, both as part of the JSON.
Depending on the Intent name, possibly change the response state.
Based on the response state and the Intent name, pick the response from a list of possible responses. You can store this list in the code, in a database, or however.
Build the JSON, including the response you’ve chosen and the new value for the Context, and send this JSON back.

Where do we go from here?

There are many things left to explore about building conversations and turning them into code including:

What happens if the user says something we don’t expect?
What if the user says nothing at all?
These conversations used very simple replies. How can we accept more flexible responses from the user about a range of things (colors, products, etc)? What if we need a lot of information from the user?
How can we get a reply by calling another API or getting data out of a database? If the API we’re calling is protected for each user (their calendar, for example), how can we get permission?
Should we treat new users differently than repeat visitors? How do we handle that?
Are there ways we can have a mostly one-sided conversation?
How can we handle multiple languages and regions?

We’ll talk about all of these conversational questions, and their technical solutions, in later posts. We’ll also look at some specifically technical issues about working with the Assistant, but keep in mind that all of them need to be used in a conversational context. Even if you think you want to dive into the code, I hope these articles have shown you how starting on the right foot by designing the conversation makes your coding a lot easier.

Conversation to Code (Part 2)

Fulfillment with the actions-on-google library

A different approach with multivocal

Logic in any language!

Where do we go from here?

Written by Allen Firstenberg