Make Your Own Conversational AI/Social Robot with robokit — A Simple Approximation of Jibo

Andrew Rapo
15 min readMar 7, 2019

--

UPDATE: August 16, 2019 — Updated to use the latest Azure speech API (v0.0.3) Details about the use of Azure cognitive services are in this article: https://medium.com/@andrew.rapo/robokit-the-cognitive-services-parts-34f57218febf

robokit

[Not too long ago…] I was working at Jibo, Inc. and developing tools to help designers explore voice-driven, character-based, social robot interactions. Then in May (2018), Jibo (the company) suddenly shut down and left many of us wondering how we would fill the void when Jibo (the robot) stopped responding. I made robokit as an homage to Jibo and as a way to continue experimenting with conversational AI. (https://wwlib.org/robokit/)

robokit has an interaction model that is similar to Jibo’s. Voice interactions are initiated with a wake word (“Hey, robo”). Audio is sent to a cloud ASR service (Automatic Speech Recognition) for transcription. The text transcript is sent to a cloud NLU service (Natural Language Understanding) which returns the most likely intents and entities from the transcript. Then a logical Hub analyzes the NLU result and determines which Skill to launch, if any. Skills respond by triggering screen animations and utilizing a cloud TTS service (Text to Speech) to generate natural voice output. robokit has a rudimentary, on-screen eye that appears to scan its environment and has animation states that indicate when robokit is listening, thinking, etc.

Note: One of the most important aspects of a true social robot’s interaction model is proactivity. A robot like Jibo can initiate interactions in response to sensory input from cameras, microphones and touch sensors. For now, robokit only responds when the wake word is detected. Integrating a camera and implementing proactivity is on the todo list.

Demo video

The demo video (below) shows a few example interactions that are handled by simple skills:

  • Hey, robo. What time is it? [ClockSkill.ts]
  • Hey robo. Tell me a joke. [JokeSkill.ts]
  • Hey robo. Who is your favorite robot? [FavoriteRobotSkill.ts]
robokit demo

Github Repo

The source code and documentation for robokit are available on github: https://github.com/wwlib/robokit

What it is

More specifically, robokit is a very simple, straightforward Electron app that can turn a device (i.e. Mac, Raspberry Pi, etc.) into a voice-driven “robot”. It is a work in progress. It’s key features include:

  • Hotword detection using Snowboy
  • Cloud ASR using Microsoft’s Bing speech api
  • Cloud NLU using Microsoft LUIS (and google)
  • Cloud Text To Speech using Microsoft Bing TTS
  • Screen animation using Pixi.js — authored using Adobe Animate
  • A simple, extensible model for developing skills — with examples
  • Remote Operation Mode to enable Woz prototyping (Wizard of Oz)
  • All wrapped in a cross-platform Electron app

Getting Started

robokit is intended to make learning about — and experimenting with — voice-driven, character-based interactions as easy as possible. Getting up and running takes just a few steps. These following instructions assume the easiest scenario — installation on an up-to-date Mac. But robokit runs well on linux-based devices including Raspberry Pi’s.

Prerequisites

There are a few prerequisites which include:

  • node v8.11 or better
  • yarn (latest)
  • sox (a linux audio library)
  • a Microsoft Azure Cognitive Services account (i.e. free trial)

Setting up Azure Cognitive Services

robokit uses Microsoft Azure Cognitive Services for ASR (Automatic Speech Recognition, aka Speech to Text), TTS (Text to Speech) and NLU (Natural Language Understanding) For details about how to setup and configure these services see: https://medium.com/@andrew.rapo/robokit-setting-up-azure-cognitive-services-bing-speech-luis-nlu-fbb39f5dc957

Setting up sox

sox is used to pipe audio to the Snowboy wake word listener and to record the 16khz mono audio needed by the cloud ASR service. Installing it on a mac is easiest using brew:

brew install sox

To help verify the sox installation, there is a jupyter notebook, hello-sox-python-mac.ipynb, in the repo’s docs/jupyter folder:

For help setting up jupyter on a mac, see: https://www.davidculley.com/installing-python-on-a-mac/

If sox is installed correctly, the following command should produce audio of a rising sine wave: sox -r 8000 -n -d synth 3 sine 300–3000

Setting up Microsoft a LUIS NLU Agent

The easiest way to set up a LUIS NLU agent for use with robokit is to use the LUIS-knowledge-graph.json file found in the repo’s docs folder. This json file contains the definition for an NLU agent that will recognize phrases like “what time is it?” and “tell me a joke”.

Installing

With the prerequisites out of the way, installing and running robokit is simple:

yarn
yarn rebuild

Note: The yarn rebuild script is necessary because Snowboy needs to be re-compiled natively for use with Electron.

Running

yarn start

When robokit starts, the Electron app will display the “eye” in its idle state:

robokit eye idle

Open Electron’s console window using Cmd-Option-i and verify that the output looks like:

If there are no errors (i.e. red text) and the message about Recording 8192 bytes is updating continuously, then Snowboy is listening and everything is working correctly.

Talking to Robo

To interact with robokit, say, “Hey, robo.” If robokit hears the wake word the on-screen eye should change to display a blue outline:

Note: There are a number of buttons along the top of the robokit screen. Several of these can be used to manually trigger the eye’s animations. These include: Idle, Listen, Blink, LookLeft and LookRight. Verify that the animation system is working by clicking these button. (For now, ignore the Speech, Hotword and Music buttons.)

Launching the Clock Skill

Let the eye return to its idle state and then say: “Hey, robo. What time is it?” For best results, pause for a moment after “Hey, robo” and then say, “What time is it?” slowly and clearly. If robokit hears you correctly, you will hear a female TTS voice announce the time. That’s Bing TTS’s default voice. Ideally, the Electron console output will look like this:

To get the desired result, a lot of things have to work correctly, including the Azure Speech apis and the LUIS NLU service. So if you did hear the female voice announcing the time — CONGRATULATIONS! If not, there is no better way to really learn how something works that to troubleshoot errors. See the Troubleshooting section at then end of this article for help with this.

The robokit Architecture and Source Code

robokit architecture

The robokit source tree looks like:

src
├── main
│ └── main.ts
└── renderer
├── AsyncToken.ts
├── HotwordController.ts
├── NLUController.ts
├── ASRController.ts
├── TTSController.ts
├── log.ts
├── microsoft
│ ├── BingSpeechApiController.ts
│ ├── BingTTSController.ts
│ └── LUISController.ts
├── pixijs
│ └── PixijsManager.ts
├── renderer.ts
├── rom
│ ├── ClientCertificate.ts
│ ├── RomManager.ts
│ ├── SocketClient.ts
│ ├── SocketServer.ts
│ ├── commands
│ │ ├── BlinkCommandHandler.ts
│ │ ├── CommandHandler.ts
│ │ ├── IdentCommandHandler.ts
│ │ ├── LookAtCommandHandler.ts
│ │ └── TtsCommandHandler.ts
│ └── log.ts
├── skills
│ ├── ClockSkill.ts
│ ├── FavoriteRobotSkill.ts
│ ├── Hub.ts
│ ├── JokeSkill.ts
│ └── Skill.ts
├── snowboy
│ └── SnowboyController.ts
├── utils
│ └── Log.ts
└── ww
└── WwMusicController.ts

Because robokit is an Electron app there are two subtrees: main and renderer. The renderer subtree contains the important code, including renderer.ts which serves as the main entry point. This is where all subsystem are initialized and managers are started, including:

  • PixijsManager — screen animation rendering
  • Hub — central nervous system (i.e. skill life cycle management)
  • RomManager — remote operation via SocketServer

When all of the UI setup is complete, the startHotword() function instantiates a new HotwordController (SnowboyController) and starts the wake word recognizer.

When the wake word is detected, the startRecognizer() function instantiates a new ASRController (BingSpeechApiController) which records the next 3 seconds of audio and sends it to the Bing cloud ASR service. (Yes, this is a very simplistic approach to End of Speech (EOS) handling)

Then the startNLU(utterance) function is called which sends the transcribed utterance from Bing ASR to the LUIS cloud NLU service:

function startNLU(utterance: string) {
const nluController: NLUController = new LUISController();
let t: AsyncToken<NLUIntentAndEntities> =
nluController.getIntentAndEntities(utterance);
t.complete
.then((intentAndEntities: NLUIntentAndEntities) => {
console.log(`NLUIntentAndEntities: `,
intentAndEntities);
Hub.Instance().handleLaunchIntent(intentAndEntities,
utterance);
})
.catch((error: any) => {
console.log(error);
});
}

The result (intentAndEntities) returned from LUIS is passed to the the Hub, the “central nervous system” of robokit

Hub.Instance().handleLaunchIntent(intentAndEntities, utterance);

The Hub analyzes the NLU result and launches the appropriate skill:

handleLaunchIntent(intentAndEntities: NLUIntentAndEntities, 
utterance: string): void {
let launchIntent = intentAndEntities.intent;
let skill: Skill | undefined =
this.launchIntentMap.get(launchIntent);
if (skill) {
skill.launch(intentAndEntities, utterance);
skill.running = true;
}
}

Skills are registered with the Hub to include them in the launchIntentMap:

registerSkill(skill: Skill): void {
console.log(`HUB: registerSkill: `, skill);
this.skillMap.set(skill.id, skill);
this.launchIntentMap.set(skill.launchIntent, skill);
}

Skills manage robokit’s output in the form of TTS audio prompts. TTS prompts are generated by calling the Bing TTS cloud service via the BingTTSController. Audio received from Bing TTS is then played back via Electron/Chrome’s WebAudio api.

Pixi.js screen animations are controlled via PixijsManager with animation directives like:

eyeInstance.gotoAndStop('idle');
eyeInstance.eye.eye_blue.visible = false;

Skills

robokit skills extend the Skill class (src/skills/Skill.ts) which looks like:

export default abstract class Skill {    public id: string;
public launchIntent: string = '';
public running: boolean = false;
constructor(id: string, launchIntent: string) {
this.id = id;
this.launchIntent = launchIntent;
}
abstract launch(intentAndEntities: NLUIntentAndEntities,
utterance: string): void;
}

The launchIntent property is used by the Hub to decide when to launch a particular skill. The launch(intentAndEntities: NLUIntentAndEntities,
utterance: string)
method — which must be overridden — is called by the Hub to when the skill’s declared launchIntent matches the NLU results.

The Clock Skill

The ClockSkill class extends the Skill class and implements the launch() method. ClockSkill’s constructor() sets its id property to ‘clockSkill’ and sets the launchIntent to ‘launchClock’, allowing it to be identified by the Hub when NLU results are received.

export default class ClockSkill extends Skill {    constructor() {
super('clockSkill', 'launchClock');
}
launch(intentAndEntities: NLUIntentAndEntities,
utterance: string) :void {
let time: Date = new Date();
let hours: number = time.getHours(); //'9';
if (hours > 12) {
hours -= 12;
}
let minutes: number = time.getMinutes(); //'35'
let minutesPrefix: string = (minutes < 10) ? 'oh' : '';
let timePrompt: string =
`The time is ${hours} ${minutesPrefix} ${minutes}`;
Hub.Instance().startTTS(timePrompt);
}
}

This very simple approach can be used to implement a wide range of skills.

The Speech API Classes

The src/microsoft folder contains three classes that use Microsoft’s bingspeech-api-client module to manage communication with cloud ASR, NLU and TTS. They are: BingSpeechApiController, BingTTSController, and LUISController. Each of these extends a base class, making it easier to incorporate other cloud integrations like google’s dialogflow, etc. The corresponding base classes (in the top level of src) are: ASRController, TTSController and NLUController.

All of the speech api controller classes make use of an AsyncToken class to help manage asynchronous cloud results. AsyncToken extends EventEmitter and ads a property called, complete, a reference to the promise returned by async cloud calls. This makes it easy to report status events while waiting for the promise to complete. The AsyncToken class looks like:

import { EventEmitter } from 'events';export default class AsyncToken<T> extends EventEmitter {    public complete: Promise<T>;    constructor() {
super();
}
}

ASRController looks like:

export default abstract class ASRController {    abstract RecognizerStart(options?: any): AsyncToken<string>;}

TTSController look like:

export default abstract class TTSController {    abstract SynthesizerStart(text: string, options?: any): 
AsyncToken<string>;
}

And NLUController looks like:

export type NLUIntentAndEntities = {
intent: string;
entities: any;
}
export type NLURequestOptions = {
languageCode?: string;
contexts?: string[];
sessionId?: string;
}
export enum NLULanguageCode {
en_US = 'en-US'
}
export default abstract class NLUController { constructor() {
}
abstract set config(config: any); abstract call(query: string, languageCode: string,
context: string, sessionId?: string): Promise<any>;
abstract getEntitiesWithResponse(response: any): any | undefined; abstract getIntentAndEntities(utterance: string,
options?: NLURequestOptions): AsyncToken<NLUIntentAndEntities>;
}

The NLUIntentAndEntities type is used to map an intent (string) to an object containing NLU results.

The NLURequestOptions type is used to provide the set of options that most cloud NLU services require (Microsoft, google, etc.)

The abstract methods, set config(config), call(query, languageCode, context, sessionId), getEntitiesWithResponse(response), getIntentAndEntities(utterance, options) must be overridden to handle the specific requirements of the cloud NLU service.

For example: LUISController

As mentioned above, the LUISController class extends NLUController. getIntentAndEntities() instantiates an AsyncToken to manage the asynchronous cloud NLU request and then executes the LUIS-specific call() method:

getIntentAndEntities(utterance: string, 
options?: NLURequestOptions): AsyncToken<NLUIntentAndEntities> {
options = options || {};
let defaultOptions: NLURequestOptions = {
languageCode: NLULanguageCode.en_US,
contexts: undefined,
sessionId: undefined
}
options = Object.assign(defaultOptions, options);
let token = new AsyncToken<NLUIntentAndEntities>();
token.complete =
new Promise<NLUIntentAndEntities>((resolve, reject) => {
this.call(utterance)
.then((response: LUISResponse) => {
let intentAndEntities: NLUIntentAndEntities = {
intent: '',
entities: undefined
}
if (response && response.topScoringIntent) {
intentAndEntities = {
intent: response.topScoringIntent.intent,
entities:
this.getEntitiesWithResponse(response)
}
}
resolve(intentAndEntities);
})
.catch((err: any) => {
reject(err);
});
});
return token;
}

call(query) uses request() to send the query (utterance) to the appropriate LUIS cloud endpoint with the required authentication tokens in the query string parameters:

call(query: string): Promise<any> {
let endpoint = this.endpoint;
let luisAppId = this.luisAppId;
let queryParams = {
"subscription-key": this.subscriptionKey,
"timezoneOffset": "0",
"verbose": true,
"q": query
}
let luisRequest = endpoint + luisAppId + '?' +
querystring.stringify(queryParams);
return new Promise((resolve, reject) => {
request(luisRequest,
((error: string, response: any, body: any) => {
if (error) {
console.log(`error:`, response, error);
reject(error);
} else {
let body_obj: any = JSON.parse(body);
resolve(body_obj);
}
}));
});
}

Then getEntitiesWithResponse(response) parses the LUIS-specific response into the more general form expected by the Hub and Skills.

getEntitiesWithResponse(response: LUISResponse): any {
let entitiesObject: any = {
user: 'Someone',
userOriginal: 'Someone',
thing: 'that',
thingOriginal: 'that'
};
response.entities.forEach((entity: LUISEntity) => {
entitiesObject[`${entity.type}Original`] = entity.entity;
if (entity.resolution && entity.resolution.values) {
entitiesObject[`${entity.type}`] =
entity.resolution.values[0];
}
});
return entitiesObject;
}

The HotwordController and Snowboy

Snowboy is a lightweight, free-for-developers hotword/wake-word detector. (https://snowboy.kitt.ai) It provides one of the key robot-like aspect of robokit — the ability to automatically respond to voice interactions. Like the speech api controller classes, the HotwordController class provides an abstract interface that is extended by the implementation-specific SnowboyController. The base class, HotwordController, looks like:

export type HotwordResult = {
hotword: string;
index?: number;
buffer?: any;
}
export default abstract class HotwordController { abstract RecognizerStart(options?: any):
AsyncToken<HotwordResult>;
}

SnowboyController looks like:

const record = require('node-record-lpcm16');
import { Detector, Models } from 'snowboy';
...const modelPath: string =
path.resolve(root, 'resources/models/HeyRobo.pmdl');
...export default class SnowboyController extends HotwordController{ public models: Models
public detector: Detector;
public mic: any;
constructor() {
super();
this.models = new Models();
this.models.add({
file: modelPath,
sensitivity: '0.5',
hotwords : 'snowboy'
});
}
...}

SnowboyController uses the npm module, node-record-lpcm16, to pipe audio to the snowboy Detector. The Detector and Models classes are imported from the npm snowboy module. The Models class loads a recognition model file which is used by the Detector to analyze the audio stream. There are several models included in the resources/models folder including HeyRobo.pmdl. This model was generated using the tools at snowboy.kitt.ai and based on three wave files containing examples of “Hey, robo.” The wave files are in the audio folder: hey-robo1.wav, hey-robo2.wav, and hey-robo3.wav. Because theHeyRobo.pmd model is based on a dataset of just three wave files (and just one user, me) it will likely not work well for everyone. A customized, user-specific model can be generated from wav files using the snowboy tools. Or, to try a more general model, edit this line to use the ‘snowboy.pmdl’ file instead:

const modelPath: string = 
path.resolve(root, 'resources/models/snowboy.pmdl');

And remember to say “Snowboy” instead of “Hey, Robo” to get robokit’s attention.

The RecognizerStart(options: any) method instantiates the snowboy Detector class and uses record to pipe audio to the Detector. Again, record uses sox to access the host’s microphone.

RecognizerStart(options: any): AsyncToken<HotwordResult> {
let sampleRate = 16000;
if (options && options.sampleRate) {
sampleRate = options.sampleRate;
}
let token = new AsyncToken<HotwordResult>();
token.complete = new Promise<HotwordResult>((resolve: any,
reject: any) => {
process.nextTick(() => {token.emit('Listening');});
this.detector = new Detector({
resource: commonResPath,
models: this.models,
audioGain: 2.0,
applyFrontend: true
});
this.detector.on('silence', () => {
token.emit('silence');
});
this.detector.on('sound', (buffer) => {
token.emit('sound');
});
this.detector.on('error', (error: any) => {
console.log('error', error);
reject(error);
});
this.detector.on('hotword', function (index, hotword, buffer) {
record.stop();
token.emit('hotword');
resolve({hotword: hotword, index: index, buffer: buffer});
});
this.mic = record.start({
threshold: 0,
sampleRate: sampleRate,
verbose: true,
});
this.mic.pipe(this.detector as any);
});
return token;
}

SnowboyController emits several events via the AsyncToken that is returned by RecognizerStart(). The events include: Listening, silence, sound, error and hotword. The hotword event is used in renderer.ts to activate the eye’s blue outline animation.

Troubleshooting

For a vastly simplified approximation of a social robot, robokit is still fairly complex. If everything doesn’t work the first time, there are several key integration points that can be tested/validated independently. A good strategy is to test in the following order:

Presentation Layer

First, does the app start up and display the on-screen eye? If not, the console log should provide some indication of why. The presentation layer of the app is a very simple Pixi.js renderer with some HTML buttons to trigger/test the animations. Any issues here can likely be resolved by google-ing error messages.

Sox

Sox is a well-known library and there is plenty of helpful information about it. As mentioned above, the hello-sox-python-mac.ipynb Jupyter notebook demonstrates a number of ways to validate the sox installation from the command line.

Snowboy

To test the snowboy wake word recognizer by itself, there is a JavaScript file in the tools folder, test-snowboy.js, that will instantiate the recognizer, start listening and indicate when the wake word (“Hey, robo”) is heard. Run this script like:

cd [robokit]/tools
node test-snowboy.js

Then say, “Hey, robo.” The console output should look like this:

$ node test-snowboy.js 
/Users/.../github/wwlib/robokit/resources/models/HeyRobo.pmdl
/Users/.../github/wwlib/robokit/resources/common.res
Recording 1 channels with sample rate 16000...
Recording 8192 bytes
Recording 8192 bytes
Recording 8192 bytes
Recording 8192 bytes
Recording 8192 bytes
Recording 8192 bytes
Recording 8192 bytes
renderer: startHotword: on hotword:
HotWord: result: { hotword: 'snowboy',
index: 1,
buffer: <Buffer 9f fe b4 fe ca fe e5 fe 07 ff 39 ff 74 ff aa ff de ff 0b 00 46 00 6c 00 95 00 bf 00 dd 00 f4 00 fc 00 fe 00 eb 00 d1 00 c6 00 b9 00 9c 00 7e 00 64 00 ... > }
Recording 442 bytes
End Recording: 1940.446ms

If there are errors, they can be addressed more easily using the snowboy test script.

The Cloud Speech API Calls

Issues with the cloud speech api calls will most likely be caused by errors in the cloud setup and/or in the config information used to authenticate api calls. Most importantly, the data/config.json file must contain valid authentication information. Note: use data/config-example.json as a starting point and save it as data/config.json.

{
"Microsoft": {
"BingSubscriptionKey": "<YOUR-BING-SUBSCRIPTION-KEY>",
"nluLUIS_endpoint": "<ENDPOINT-URL>",
"nluLUIS_appId": "<YOUR-LUIS-APP-ID>",
"nluLUIS_subscriptionKey": "<YOUR-LUIS-SUBSCRIPTION-KEY>"
}
}

In the tools folder there are three JavaScript files that can be used to test the ASR, TTS and NLU apis.

test-bing-speech.js
test-bing-tts.js
test-luis-nlu.js

Details for setting up robokit’s NLU and using these testing tools are in the article: https://medium.com/@andrew.rapo/robokit-setting-up-azure-cognitive-services-bing-speech-luis-nlu-fbb39f5dc957

PixiAnimate and Pixi.js

robokit’s eye animations are authored using Adobe Animate’s familiar timeline animation features and rendered using Pixi.js. For details see: https://medium.com/@andrew.rapo/robokit-using-adobe-animate-and-the-pixianimate-extension-to-create-webgl-pixi-js-a862a0f0296a

Remote Operation Mode (Rom)

robokit includes a socket server that allows it to receive connections from — and be operated remotely — by a remote controller app. Remote control is a powerful way to test and study human-robot interactions while fully autonomous software is being designed developed. Real time remote control by a human operator (puppeteering) is often referred to as Woz prototyping (“pay no attention to that man behind the curtain…”). robokit is Woz-ready.

robokit is also designed to enable Remote Operation Mode development where an autonomous client controller app acts as the robot’s brain by intercepting UI events from robokit and then remote controlling robokit based on its own logic. For an example of a Remote Operation Mode controller that is compatible with robokit, see: robocommander (http://robocommander.io)

robokit Remote Operation Mode (i.e. via robocommander)

Summary

If you have always dreamed of creating your own social“robot” — one that you can interact with in a natural, voice-driven way — then robokit is a great place to start. This simple framework demonstrates how to leverage the powerful, cloud-based cognitive services that are now readily available. And by getting robokit working you will be a long way toward understanding what is going on inside a more sophisticated robot like Jibo.

--

--

Andrew Rapo

AI Designer/Developer. Formerly at Disney, Warner Bros., Hasbro, Jibo, MIT Media Lab and Nuance. Now at NTT Disruption as Conversational/Character AI lead.