“Computer! Tea, Earl Grey, Hot”: Offline Voice on NodeJS, Part II

David Bartle
Picovoice
Published in
5 min readFeb 11, 2021

This article continues our journey of creating the voice interface for the food replicator from Star Trek, which started in Part I.

In Part I, we created a new NodeJS project and used the Porcupine wake word engine to detect the trigger word “Computer”. Now we’ll handle the follow-on command: “Tea, Earl Grey, Hot”.

Running the “Replicator” Voice AI in NodeJS

If you wish, you can use Porcupine to wake up the application, then forward all the subsequent audio data to Amazon or Google’s cloud services. Instead, we’re going to create a Speech-to-Intent context: a bespoke speech model that is tuned for this purpose, using minimal resources. The audio data never needs to leave the device, providing intrinsic privacy and reliability.

Creating the Replicator Context

We’ll use Picovoice Console to create a custom speech recognition model that understands “replicator” commands, using the Rhino Speech-to-Intent engine.

Go to the “Rhino” section. Enter “Replicator” for the context name, select English (en) for the language, and click “Create”. Leave the template to the default “Empty”. Click on the newly-created context name to go to the Rhino editor.

Creating a new Rhino context in Picovoice Console

The Rhino editor lets you create intents. Each intent is composed of expressions, which are defined in a straightforward grammar. When a speech utterance matches an expression, the intent is detected (and optionally, specific details of the utterance are also included).

By default, a Rhino context understands nothing. This is in direct contrast to using Speech-to-Text, which must comprehend every possible phrase in a natural language and transcribe it into text. This helps intuit why a bespoke speech domain — like food ordering — can be dramatically optimized both in terms of resource requirements and accuracy.

The Rhino editor lets you design a context in your browser, and includes live pronunciation and grammar checking. I’ve created and exported a Rhino context using Picovoice Console for brevity.

Below is the basic context (in YAML format) that will handle the phrasing we want and allow for a variety of temperatures and teas. For the purposes of demonstration, it is skeletal, but Rhino has no inherent technical limit on the size of a context.

context:
expressions:
makeTea:
- tea $flavor:flavor $temperature:temp
slots:
flavor:
- orange pekoe
- green
- jasmine
- black
- earl grey
temperature:
- cold
- cool
- warm
- toasty
- hot

Use the YAML import feature at the bottom-right of the console to import the context:

Importing a pre-built context in YAML format

The “makeTea” intent is listening only for the (somewhat robotic!) phrasing used by Picard. You can try expanding the list of expressions to allow more natural speaking, but this will suffice for our demo.

The slot types (“flavor”, “temperature”) are an enumeration of different allowed phrases. When the specific utterance is recorded, we capture the slot value (e.g. “earl grey”, “hot”).

Using the Rhino editor to define the voice command structure

Try using the microphone button on the right-hand side to run the context in the browser. The context will save, rebuild, and then start listening. After it displays “Listening for microphone input…”, say “tea, earl grey, hot”. You should see the following result:

Testing the context in the Console

You’ll notice that if you try saying something else, like “tell me a joke”, it won’t understand. This speech context is focused to handle the task at hand.

Click “Train”, select your platform, and then click “Train” to confirm.

Selecting the target platform for our model training

The model will complete training in about 5 seconds. You can download it via the “Models” tab. Extract the .rhn file from the Zip archive and move it to the replicator folder. Let’s rename it to replicator.rhn to match the code in the forthcoming gist.

Swapping Porcupine for Picovoice SDK

The Picovoice SDK for NodeJS includes Porcupine and Rhino as dependencies. It’s possible to use each engine individually, but the Picovoice SDK simplifies things by making a few assumptions about how to use them together.

Swap out the Porcupine dependency for Picovoice:

npm remove @picovoice/porcupine-node
npm install @picovoice/picovoice-node

Update index.js with this updated gist that uses the Picovoice SDK:

Using the Picovoice SDK for NodeJS to handle both “Computer” and beverage choice

Make sure to replace ${ACCESS_KEY} with your AccessKey, which can be obtained using Picovoice Console as mentioned in Part 1.

Running the demo

Let’s try running the upgraded demo:

$ node index
Listening for 'COMPUTER'...
Press ctrl+c to exit.
Detected 'COMPUTER'
What beverage would you like?
context:
expressions:
makeTea:
- tea $flavor:flavor $temperature:temp
slots:
flavor:
- orange pekoe
- green
- jasmine
- black
- earl grey
temperature:
- cold
- cool
- warm
- toasty
- hot
macros: {}
Inference:
{
isUnderstood: true,
intent: 'makeTea',
slots: { flavor: 'earl grey', temp: 'hot' }
}

Thus, we have our basic replicator voice interface. The application will automatically switch back to listening for “Computer”, so you can try out different combinations if hot earl grey isn’t your favorite.

Inference:
{
isUnderstood: true,
intent: 'makeTea',
slots: { flavor: 'green', temp: 'toasty' }
}

Understanding the Code

The Picovoice SDK offers more abstraction on top of the underlying Porcupine and Rhino engines.

We provide the Picovoice constructor with callback functions for wake word and inference events:

const porcupineCallback = keyword => {
console.log(`Detected 'COMPUTER'`);
console.log(`What beverage would you like?`);
console.log(picovoice.contextInfo)
}
const rhinoCallback = inference => {
console.log("Inference:")
console.log(inference)
console.log()
console.log()
console.log(`Listening for 'COMPUTER'...`);
}
const picovoice = new Picovoice(
"${ACCESS_KEY}",
BuiltinKeyword.COMPUTER,
porcupineCallback,
contextPath,
rhinoCallback
);

We’re also printing out the “contextInfo”, which is the source YAML we used to create the context, as a cheat sheet for which commands it understands.

From here, you can boldly go and create your own Voice UI and run it with the Picovoice SDK for NodeJS.

--

--