The Conversation Factory

Advanced Architecture and Development Practices When Engineering Skills for Multiple Device Types

Published in

Capital One Tech

8 min readJun 21, 2017

The front door to the Internet is moving and multiplying. Voice and multi-modal conversational interfaces are exploding as AI and human/machine interfaces are in early-stage revolution. A year and a half ago when my team at Capital One started working with reactive turn-based conversational UIs, we found nearly all had similar constructs. We took these constructions and created what we call the Conversation Factory, which is a set of patterns and abstractions that can be applied to all the conversational UIs that fit this paradigm. These patterns and abstractions allow us to maintain speed and quality as we develop for new channels and multi-modal devices that have voice and visual experiences combined.

Our codebase was designed using solid engineering principles and practices, allowing us to introduce the Conversation Factory with no ripple effect throughout the Skill. The rest of the Skill’s design is outside the scope of this blog, but it’s imperative to understand that thoughtful design must be applied throughout all layers of the code base.

Using a sequence diagram, we show what the interaction between the Conversation Factory components look like:

Intent Middleware — Express.js middleware component responsible for routing the incoming request to the appropriate intent handler. It uses the AdapterFactory to create the appropriate set of Interaction Factories for the request’s channel. This adapter is passed to the matching Intent Handler.
AdapterFactory — This component is responsible for inspecting the incoming request to determine which channel specific InteractionAdapter should be created.
InteractionAdapter — Abstracts out interactions required to obtain information about the incoming request: user context, intent, slots/entities, etc.
DisplayFactory — Interface for creating platform specific displays.
SpeechFactory — Interface for creating platform specific SSML/speech.
IntentHandler — A handler for each intent exists to determine the appropriate response to return based off user context and inbound entities.

Factories

A class diagram quickly shows how our implementations would look. For brevity’s sake, this isn’t a complete class diagram, but merely used to show some implementation details:

Some TypeScript code snippets follow to illustrate our abstractions.

SpeechFactory

While all voice platforms support SSML, not all support SSML equally. For our Skill, we say the last four digits of an account number for multi-account responses: “…your Venture account ending in 1 2 3 4.” SSML allows us to inform the voice engine to say the numbers “one two three four” as characters, versus saying them as a number “one thousand two hundred thirty-four.” In addition to this Capital One-specific need, we have found that date representations are also spoken differently on each platform. We’ve given abstractions to only these two, but with the SpeechFactory in place, we can easily add new implementations when the need arises.

export interface SpeechFactory {    speakLastFour(lastFour: string): string;    speakDate(date: string, format?: string): string;}

DisplayFactory

With Cortana in General Availability and announcements for multi-modal devices such as the Echo Show, it was imperative that we also have an abstraction for how our Skill displays related information to our customers. It can be as simple as text on a Skill card or a richer UI that you would find on the Cortana Canvas or the screen of an Echo Show. We abstracted all our use cases and represented them in an interface, so we can add a multitude of implementations dependent on channel and display support on each of those channels.

export interface DisplayFactory {    showTransactionList(transactions: Transaction[]): any;    showAccountSummary(accounts: Account[], currencyType?: CurrencyType, balanceType?: string): any;    showOwe(accounts: Account[]): any;    showPayBill(account: Account, payBillData?: PayBillData): any;    showPayConfirm(account: Account, payBillData?: PayBillData): any;    showPaySuccess(confirmationCode: string): any;    showSpent(transactionResponse: TransactionResponse, sessionData: TransactionSessionData): any;    showFailure(): any;    showHelp(): any;    showAccountSelector(accounts: Account[]): any;    showPayoff(accountList: Account[]): any;    showEasterEgg(): any;    showGoodbye(): any;    showWelcomePin(): any;    showWrongPin(): any;    showLockout(): any;    showWelcome(): any;    showAbandonment(): any;    showEntitlementsLockout(): any;}

Our design team has conducted numerous User Labs with our customers to determine how they interact with a visual display in conjunction with a voice experience; the findings have been fascinating. We have found that how the customer interacts with the Skill largely depends on the proximity of the device and the UI capabilities. With an abstraction layer, we can provide the appropriate experience dependent on channel and device features.

InteractionAdapter

The heart and soul of the Conversation Factory is the Interaction Adapter, which is the tie between the channel agnostic code responsible for retrieving the necessary data to provide a response or performing a task on behalf of our customer, and the channel specific implementation details of interacting with that platform.

We have found common behavior between responsive turn-based paradigms for voice platforms and have abstracted that out into this base class.

The ask() method accepts the voice response and identifies to the adapter to keep the conversation open as we are expecting a response.

public ask(speech: string) {    this.safeAppendSpeech(speech);    this.askTell = "Ask";}

The tell() method accepts the voice response and has no notion of whether the conversation should remain open, that’s determined by the channel specific implementation and is usually dependent on how the Skill was invoked (Ask vs Open).

public tell(speech: string) {    this.safeAppendSpeech(speech);    this.askTell = "Tell";}

Additionally, we have found the re-prompt is common across channels and have introduced a method to accept that speech, reprompt().

public reprompt(repromptSpeech: string) {    this.repromptSpeech = repromptSpeech;}

In addition to these basic methods, we have a few others to help support specific use cases for various reasons such as keeping the dialog open, suppressing re-prompt, and sending a callback once the response is sent.

public sendCallback(callback: any) {    this.callback = callback;}public shouldNotReprompt(value: boolean): void {    this.doNotReprompt = value;}public shouldKeepDialogOpen(value: boolean): void {    this.askTell = value ? "Ask" : "Tell";}

We defer to channel specific implementations to provide details about the inbound message such as intent, arguments (entities/slots), conversation context, and user context. In addition to provide these details about the inbound message, the Adapter is responsible for providing the channel/feature specific factories for both Speech and Display.

public abstract getIntent(): string;public abstract getArgument(key: string): any;public abstract clearArgument(key: string): void;public abstract getState(key?: string): any;public abstract clearState(): void;public abstract getUserContext(): UserContext;public abstract send(speech?: string): void;public abstract getSpeechFactory(): SpeechFactory;public abstract getDisplayFactory(): DisplayFactory;public abstract getPlatformId(): string;public abstract showLinkingCard(): void;public abstract display(display: any): void;

Example Implementation

In the previous section, we identified how we abstract the channel specific implementations to allow us to add new channels and features quickly with minimal impact to existing code. In the following sections, we’ll look at a partial concrete example and after that, how it’s all tied together.

MsbotInteractionAdapter

The constructor accepts the Bot Framework Session object and arguments, which are provided on an incoming intent request. We use these to extract intent, entities, and user information to help satisfy our contractual requirements for the InteractionAdapter. We also create the appropriate Display and Speech factories. While we have only a single Display and Speech factory for Cortana, we could interrogate the incoming request for the type of device, supported features, or user preferences to instantiate a wide variety of implementations to give the exact experience we desire.

constructor(protected session: builder.Session, protected botArgs: any) {    super();    this.intent = botArgs.intent;    this.state = JSON.stringify(botArgs.privateConversationData);    this.speechFactory = new MsbotSpeechFactory();    this.displayFactory = new MsbotDisplayFactory(this, session);    this.context = {        clientCorrelationId: session.message.sourceEvent.clientActivityId,        userId: session.message.user.id,    } as UserContext;}

The other InteractionAdapter methods that must be implemented all contain code that bridge our non-channel specific method calls and the MS Bot Framework specific Session and botArg objects. Please look at the Bot Builder framework examples on how to interrogate the Session object to support your Skill needs.

In Practice

Now that we’ve seen the abstractions, let’s look at how it all comes together with no channel specific details bleeding into our intent handlers.

Account Balance Intent Handler

We’ll use our base use case of account balance to illustrate. Out of scope for this discussion is how we orchestrate over our APIs, cache clusters, and databases to get the necessary information to populate the response.

Interrogating the Inbound Message

In this snippet of code, we’re extracting the arguments (entities) of ProductType and LastFour as this will drive how/what we present to the user on the response. We then use a service to go retrieve the necessary data and direct that data response into building the response for the user.

const lastFour = adapter.getArgument("LastFour");const productType = adapter.getArgument("ProductType");accounts.getAccount(adapter.getUserContext(), {accountType, lastFour, productType})    .then(accountList => buildAccountBalanceResponse(adapter, accountList, accountType, adapter.getArgument("CurrencyType")))    .catch(error => handleDataRetrievalError(error, adapter, true));

Building the Response

While building the response, a call is made to the display factory provided by the interaction adapter so we can build a channel appropriate visual response. We then call another helper method to pull together the account total

function buildAccountBalanceResponse(adapter: InteractionAdapter, accountList: Account[], accountType: AccountType, currencyType: CurrencyType) {    adapter.getDisplayFactory().showAccountSummary(accountList, currencyType);    adapter.send(accountList            .map(account => readAccountTotal(adapter.getSpeechFactory(), account, accountList.length > 1, isBank(account.accountType) ? currencyType : undefined))            .join(""));}

This helper method is responsible for looking at all the different account types and pulling together the speech for that response. We’re hiding some code here to keep it targeted on credit cards:

function readAccountTotal(speechFactory: SpeechFactory, account: Account, useMulti: boolean, currencyType: CurrencyType) {    const balance = formatCurrency(account.accountBalance, currencyType);    const balanceName = isBank(account.accountType) ? "available" : isCardAccountType(account.accountType) ? "current" : "principal";    const suffix = getSuffix(account.accountType);    const productName = isLoan(account.accountType) ? "" : ` ${account.toSpeech(speechFactory, useMulti)}`;    return `Your ${balanceName}${productName} balance is ${balance}${suffix}.  `;

Finally, we defer to the Account model’s toSpeech() method, passing in the Speech Factory to ensure last four is read appropriately:

public toSpeech(speechFactory: SpeechFactory, sayLast4 = false) {    const last4Speech = sayLast4 ? ` ending in ${speechFactory.speakLastFour(this.lastFour)}` : "";    if (this.accountType === AccountType.CREDIT) {        return `${this.productName} card${last4Speech}`;    }    ...}

Summary

Multi-modal conversational interfaces are exploding and many AI/NLP providers are emerging in this space. Building software for re-use, speed, and quality in the context of multi-modal conversations and various AI/NLP providers requires thoughtful decisions on several levels. At Capital One, we have leveraged solution architecture along with patterns and abstractions to allow us to continue to maintain speed and quality as we engage in this rapidly changing landscape.

DISCLOSURE STATEMENT: These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2017 Capital One.

For more on APIs, open source, community events, and developer culture at Capital One, visit DevExchange, our one-stop developer portal. https://developer.capitalone.com/