Tips and gotchas using Alexa custom slots

Published in

voiceflow

7 min readOct 16, 2017

I see a lot of similar questions regarding custom slots in the Alexa Developer Forums so am summarizing key points here to help new Alexa developers as well as to clarify my own understanding.

1. Custom Slots are not restricted to the Values you provide.

It’s not initially clear or intuitive that a custom slot is not restricted to the sample slot values you define for it. In fact, the sample values you provide for a Custom Slot Type are just a ‘guideline’ to Alexa.

For example, consider a Skill (My Butler) with just one intent (Paint) which the user might invoke with:

“Alexa, ask My Butler to paint the house blue/green/red”

Rather than define a separate Intent for each possible colour, we use a Custom Slot (named ‘colour’) to store the value requested, and define a Custom Slot Type for it. The resulting Intent Schema would be:

{
  "intents": [
    {
      "slots": [
        {
          "name": "colour",
          "type": "COLOUR_TYPE"
        }
      ],
      "intent": "PaintIntent"
    }
  ]
}

And we provide initial values to the Custom Slot Type (COLOUR_TYPE):

Finally, we have a single utterance for the intent:

PaintIntent to paint the house {colour}

Given the scenario above, what will the slot value be when the user says “Alexa, ask My Butler to paint the house red” ?

Yes, it will be ‘red’:

However, if the user asks for a colour not defined for the slot, Alexa will happily send that colour too. For example, “Alexa, ask My Butler to paint the house pink” sends ‘pink’ even though that was not one of the initial values:

2. The number of words in the values provided for your Custom Slot Type is important.

Building on the previous example, what do you think the slot value will be if the user says “Alexa, ask My Butler to paint the house bright red” ?

Sorry if you were expecting it to be ‘bright red’… Alexa just sends ‘red’ !

What’s happening here? It seems that since we only provided one-word slot values when defining the custom slot, Alexa concludes that we only want a one word value.

This can easily be fixed by providing some 2-word or 3-word phrases as the initial values for the Custom Slot Type:

By providing 2-word (‘bright red’) and 3-word (‘dark sea green’) values for the slot type, Alexa can now detect multiple words in the slot:

Note that even though we only defined 1-word, 2-word and 3-word slot values, Alexa will gladly pass through 4 words or more:

Bottom-line: if you only define 1-word slot values for a Custom Slot Type, expect Alexa to only return a single word for the corresponding slot. Define some 2-word or 3-word slot values to pass through everything the user says.

3. How do you just capture all input from the user?

The previous example (above) shows that defining a Custom Slot Type with a mix of 1-word, 2-word and 3-word values will effectively allow Alexa to pass through any text which matches those initial values as well as anything else said by the user.

However, the ‘traditional’ way to do this is with the AMAZON.LITERAL slot-type which is supposed to send through everything which the user says.

Defining the Intent Schema is straight-forward:

{
  "intents": [
    {
      "slots": [
        {
          "name": "CatchAllSlot",
          "type": "AMAZON.LITERAL"
        }
      ],
      "intent": "CatchAllIntent"
    }
  ]
}

The key gotcha is that a slot of type AMAZON.LITERAL requires sample slot values within the curly brackets defining the slot in the utterance as mentioned here https://developer.amazon.com/docs/custom-skills/custom-interaction-model-reference.html#literal

i.e. normally we would defined an utterance as follows (but this will throw an error by the Alexa Developer Console when you try to hit ‘Save’):

CatchAllIntent check flight {CatchAllSlot}      <-- THIS IS WRONG !

Instead, you actually have to use something like this:

CatchAllIntent check flight {flight number|CatchAllSlot}

What’s up with placing representative slot values as part of your sample utterances? This is apparently due to legacy reasons. However, the paragraph which follows in the AMAZON.LITERAL documentation is pure gold and essentially explains the earlier points:

Note the following rules and recommendations.
Include samples with different numbers of words for the slot value:
- Samples with the minimum number of words you expect for the slot value.
- Samples with the maximum number of words for the slot value.
- Samples with all varying numbers of words between the minimum and the maximum expected.
These samples should always include only slot values that represent actual phrases the user might say. Do not use meaningless placeholder words in the sample phrase just to fill the slot with the right number of words. Instead, fill the sample slot value with real-world examples of the data you want to collect in the slot.

[Sidenote: I actually find it hard to abide by the ‘Do not use meaningless placeholder words in the sample phrase just to fill the slot with the right number of words’ rule all of the time.]

How do you squeeze all of the sample slot values into the parenthesis? You don’t. You have to create multiple utterances. An example is provided in the documentation:

If you are using the AMAZON.LITERAL type to collect free-form text with wide variations in the number of words that might be in the slot, note the following:
- Covering this full range (minimum, maximum, and all in between) will require a very large set of samples. Try to provide several hundred samples or more to address all the variations in slot value words as noted above.

NOTE:

Comparing the AMAZON.LITERAL interaction definition with the Custom Slot Type described in part 2, you see that essentially they’re the same thing. One requires you to provide samples by defining a Custom Slot Type, while AMAZON.LITERAL requires you provide samples in the utterance. Or as Michael Palermo writes on the Alexa blog, Why a Custom Slot is the Literal Solution.

4. Don’t try to take extended input from the user

Want to build a Skill which you can dictate text to — say, to send messages or for note-taking? Don’t bother, it’s not possible yet.

Alexa uses an algorithm to determine when the user has finished speaking, based principally on duration of silence between words, and perhaps also considering speech cadence, words spoken, last word(s) detected.

Amazon obviously strives to get the balance just right — use too short a break to signify end-of-utterance and more people will be cut off mid-sentence; use too long a break to signify end-of-utterance and it introduces unnecessarily long pauses in even the shortest of commands, which would ruin the user experience.

For the most part, Alexa gets it right. However, as anybody who has tried dictating voice notes to Siri knows, we pause a lot when speaking and those pauses are interpreted as completion of task.

This too is tucked away in the AMAZON.LITERAL documentation:

- Keep the phrases within slots short enough that users can say the entire phrase without needing to pause.
- Lengthy spoken input can lead to lower accuracy experiences, so avoid designing a spoken language interface that requires more than a few words for a slot value. A phrase that a user cannot speak without pausing is too long for a slot value.

Also, the support team in the Alexa Developer Forum have mentioned that there is a 8–10 second hard cut off when taking user input.

Bottom-line: don’t design a Skill which allows users to dictate long messages to Alexa. Not yet, anyway.

(I hope for a day when the length of the user input will be programmable, perhaps by way of voice procedure.)

5. Expect high inaccuracy for spelling and acronyms.

Alexa has a LOT of difficulty recognizing individual letters. But then again so do humans — spell your name over the phone and you’ll be asked whether that was a D or a T, a B or a V or perhaps even P.

It’s why we have the military phonetic alphabet.

Combining this and the previous point, spelling games aimed at young kids are just a disaster — many children don’t enunciate the letters correctly, and Alexa interprets their pauses (as they think of the next letter) to mean they’ve finished speaking. It’s simply an exercise in frustration for the developer and the poor child alike !

OK, that’s enough for now. Thanks for reading and hope this helps somebody.

p.s. There’s actually something else important you should know about Custom Slots… they can affect the Intent-mapping. This is a huge potential gotcha:

How Utterances & Slot samples affect Intent-matching in Alexa Skills

This is perhaps the biggest area of confusion and uncertainty for developers building Alexa Skills. This article…

medium.com