
CAVA: Conversational Assistant, Virtual Agent
A Framework for Getting Things Done in the Modern World
The CAVA design provides one conversational (chat) assistant and one virtual agent to everyone.
Why the two?
Think of how busy people get stuff done in a way that does not interfere with their primary focus.
Such people often have both a personal assistant to attend to immediate needs, and an agent doing the time-consuming stuff that does not require the person’s immediate interaction.
The assistant is responsive, light, fast and uses a Moleskin® or iPad to keep track of stuff.
The agent is methodic, slower, persistent, quirky, thorough and prefers spreadsheets or a database on a desktop computer.
Agents are really good at communicating with other agents
Assistants are really good communicating with the person they work for.
Often the agent and the assistant communicate with each other more than the person might communicate to either of them directly. These two are the basis of the CAVA model.
Domain Model Glossary
- A ConversationalAssistant and a VirtualAgent both implement the Responder interface and are composed of Parts, Responses, and Actions which are encapsulated away behind the RespondTo(MessageEvent) MessageEvent operation.
- A Responder is simply anything that implements a method for the required RespondTo(MessageEvent) MessageEvent operation. This interface ensures Responder pipelines can easily be created
- An Message is any CAVAML text that can be exchanged, using natural language and including emojis. Messages strictly limited to 207 characters to ensure reasonable vocalization and display on even the smallest of screens.
- A MessageEvent is a Message encapsulated with context (State) and methods.
- State is a fluid, schema-less, key-value data bucket linked to MessageEvents providing a full context of the event to Parts, Responses, Tasks, and Actions. State implementations need not be concurrency safe because no state read or write must ever be done in way that is unsafe. Values are restricted to strings only which can be cast as needed by implementations to whatever actual values are needed.
- An Utterance is a Message that is voiced.
- A Response handles a MessageEvent immediately by returning a Message (true) or nothing (false). These equate to the blocks of a more explicit
if/elseif/.../elseimplementation. - A Part handles a MessageEvent immediately by returning the name of the next Part.
- An Action immediately handles a MessageEvent by returning a Task identifier (true) or nothing (false). Actions are initiated within Responses or Parts.
- A Task is any computing task that must be done be it small or large and is universally uniquely identified for any specific instance of the task. An Action returns a Task identifier. Task execution must be asynchronous. Task implementations may include priority, timeouts and other properties similar to a process on an operating system. Indeed, Tasks are designed to map 1-for-1 with such processes to allow multi-core processing and load distribution.
- CAVAML is an extremely simplified, linear markup language for representing conversational Messages such that they can include emojis, effects, and other in-line variables and event triggers. Emojis are assumed to begin and end with a colon with no whitespace
:grinning:(as on Emojipedia and Medium). Both variable substitution and event triggers use the single, wrapping brace notation{name}or{shake}. The four main formatting Markdown styles are supported (*italic*, **bold**, ***bold italic***,`code`). Inline Markdown links are supported, but the length of the URL must be less than 30 characters (use a shortener). Inline Markdown images are also supported ( but must be the only thing in the message. In addition, the Markdown image notation can be used to point to videos that will be embedded if possible. Unlike inline links, image and video URLS can be as long as 200 characters. HTML is specifically prohibited. Any consecutive capital letters will be automatically converted to contain dashes separating them when passed to any Utterance player. Messages that begin with a render fence (three or more backticks ```) indicate special content to be rendered however preferable by the user, Assistant, or Agent. These messages are usually not voices but contain content to be visually consumed. The maximum message size limitation remains, however. This promotes linking to longer, multi-line content instead of sending it as messages.
Key Design Constraints
- The CAVA framework is designed for light-weight, human-friendly messaging. Any exchange of structured or larger amounts of data must not be done through CAVA Messages. For this reason specific constraints have been placed on the design to force better approaches and solutions for larger data exchange.
- At the lowest level, everything is synchronous. Only ONE logical VirtualAgent instance and ONE ConversationalAssistant instance per person.
- Do not complicate implementations—particularly State which can usually just be a map/object/dictionary. For example, a concurrency-safe Map is not needed in Go implementations because the implementation of the agent and assistant itself must never require such concurrency.
- The execution of Parts, Responses, and Actions must be synchronous. None of these should ever be invoked while another is happening within a single VirtualAgent or ConversationalAssistant.
- The execution of Tasks (by Actions) must be asynchronous.
- The Assistant must be reasonably light and dependent on the Agent for things of larger scope.
- The ConversationalAssistant State must not exceed 5 megabytes total enabling simplified implementations (using just LocalStorage instead of a full IndexDB for example) and forcing designs that rely on the VirtualAgent for larger and more persistent storage. No specific State storage implementation is defined, however.
- The Assistant and Agent must speak natural language to each other passing Messages over the Internet. This allows the maximum flexibility when collaborating with other Assistants and Agents out there. Natural language is the one API to rule them all.
- Messages (both received and sent) are strictly limited to 207 displayed characters to ensure reasonable vocalization and display on even the smallest of screens.

Thoughts on Implementation
Implementors are encouraged to first implement their models in modern JavaScript to give the best sense of the asynchronous nature of most of the elements. This helps avoid implementation of Parts, Responses, and Actions that do not return immediately (as required) but hang around—potentially blocking—while they do what should have been implemented as Task instead.
The same implementor using the imperative Go may not recognize the blocking nature of something and unwittingly violate the design making ports and further development performance bound.
Specific Implementation Choices
The core CAVA reference implementation will contain the following specific technologies:
The ConversationalAssistant will be implemented as an offline-first, optionally voice-enabled (SpeechSynthesis) modern web app (not progressive because no support for non-evergreen browsers) using simple LocalStorage for State with notifications enabled to allow it to initiate conversations such as when reporting back after a Task has been completed.
The VirtualAgent will be implemented as a simple web service that will intelligently sense the client type and respond appropriately.
Motivation
There is a lot out there to create assistants and actions for specific platforms but not really a lot to standardize the creation of independent applications and platforms.
Specifically this model strongly encourages the use of regular expressions which are regrettably absent from Alexa and other platforms.
Future Development
CAVA is currently just a design pattern right now but I am actively developing its architecture and primary components (since I have already had great success with them in the classroom).
Sharing
I share CAVA’s design to flesh out the particulars with other members of the voice-first and edtech community—particularly since voice integration stands to revolutionize interactive learning. I’m not particularly afraid of others stealing my ideas (as hoarder academics and corporate types tend to do). I just want to further potential adoption of these ideas. If you would like to help us realize them in any way please let me know.
CAVA pairs well with the SOIL project we have spear-headed since SOIL modules can easily be rendered using rich-media conversational user interfaces.
