The Three Eras of Conversational AI — a framework for explanation

Published in

Speaking Artificially

11 min readFeb 16, 2023

There’s a holy trinity in machine learning: models, data, and compute. Models are algorithms that take inputs and produce outputs. Data refers to the examples the algorithms are trained on. To learn something, there must be enough data with enough richness that the algorithms can produce useful output. Models must be flexible enough to capture the complexity in the data. And finally, there has to be enough computing power to run the algorithms. — The generative AI revolution has begun — how did we get here? | Ars Technica

The referenced Ars Technica piece captures elegantly the progressive evolution of Artificial Intelligence. In fact, Artificial Intelligence capabilities have existed dating well before the 1990s, however, compute and data capture were difficult to obtain, and therefore, the technological capabilities never popularized. This reality fascinated me as my experience in the field as an Artificial Intelligence Product Manager continued to grow and deepen. When public displays of intelligence emerge and cause societal hype, skepticism, fear, and general discourse, that is the indication of a new A.I “era” and inflection point for the technology.

Thereafter, the technology typically follows the Gartner Hype Cycle

Since the start of my career in Artificial Intelligence at IBM Watson, I’ve experienced four distinct eras in AI, specifically Conversational AI / “Cognitive Computing,” and can conceptualize these eras into a logical explanatory framework, which I seek to explain with this piece.

At heart, I am a board game geek! One card drafting game I’ve come to enjoy is 7 Wonders (and its 2-player version, 7 Wonders Duel), whereby players draft cards spanning multiple categories (raw materials, buildings, science, war, market, etc) to construct their Wonders of the World. One component of the game is that the game takes place across 3 ages, wheerby each card constructed in the previous era can build into other cards in the following era and still have unique relevace on a player’s tableau. The three eras of Conversational AI follow the same logic, whereby each era builds upon the previous in terms of domain knowledge and contextual power, but still maintain unique relevance within existing solution sets, narrowing to more niche use cases over time / through the “Plateau of Productivity”

Lets outline them invitationally here:

Eras:

Intent Intelligence — AI that understands what you say and want

The Conversational AI industry entered its infancy and started centralizing around a common set of frameworks collaborated on by a consortium of large enterprises and researchers. During this period, AI systems were programed alongside rigid instructional frameworks in rich text editors. Systems are required to be highly accurate and precise, constrain the scope of their applicability and intake, and applicable responses needed to contain limited deviations to maintain consistency.

Interactive Voice Response (“IVR”) applications, specifically to service common customer tasks or route customers accordingly prior to interacting with an agent, were the common display of this type of conversational AI.

Conversational AI Engines during this time period included:

Speech Recognition — Speech recognition capabilities were performed by constrained lists, often referred to as “grammars.” These grammars adhered to a standardized specification, the Speech Recognition Grammar Specification (SRGS) that was initially developed by Scansoft (Now Nuance Communications) and HP. The SRGS specification allows speech scientists to define a constrained set of words or phrases for a speech recognizer to interpret upon receiving inbound audio.

Grammar-based constrained speech recognition was largely used for Interactive Voice Response (“IVR”) applications that powered front-end intercepts to call centers prior to being transferred to a live agent. These applications prompted a user for input, using a recorded prompt or speech synthesizer (text-to-speech) and in turn, got an utterance. This turn-like cycle captured information from a caller to make an informed action, transfer point, or new route within the call logic- often called Directed Dialog. Other applications of the technology within IVRs included menu trees, similar to “Press 1 or say ‘checking’,” type of options.

Traditional recognizers are optimal for the aforementioned use cases and provide a high level of accuracy in speech recognition. The solutions are fairly trivial to implement albeit cumbersome to tune grammars.

Text to Speech — speech synthesis capabilities relied on an underlying algorithm called Unit Selection. With Unit Selection, a corpus of pre-recorded audio and orthographic transcription was entered as training data and ground truth. The backend system broke apart the phonetic structure of the ground truth to capture the underlying voice of the voice talent. When called upon, the speech synthesizer would interpret the text and fetch a series of phonetic matches from the training set, concatenating the sounds together to make the audio output. This model of speech synthesis relied heavily on rulesets for logic, including:

Pronunciation dictionaries — these dictionaries took specific words and mapped them to a phonetic alphabet. Upon parsing the word within the dictionary, the synthesizer would fetch the phonetic spelling and utilize that for the playback
Rulesets — Often times, the synthesizer would fail to make a coherent audio playback that adhered to societal norms. For example, commentating on a sports game, the phrase “1–3” would be spoken as “one to three,” however, a synthesizer at this point would say “one dash three” so a ruleset would be required to map the dash to the word “to.”

Applicaiton Logic — Applicaiton logic typically relied on industry standards set forth by large enterprises. For example, the Voice XML language was developed by IBM, AT&T, Lucent, and Motorola. This logic used a voice browser to interpret the incoming telephony audio and run a pre-defined logic tree or script. Any text-based application followed suit. Enterprises, did, however, build proprietary parameters and functionality on top of the standard, as is typical practice, but they largely followed a common standard.

2. Conversational Intelligence— AI Engines that engage in conversation

This era can be largely defined by trainable engines underpinned by supervised machine learning, whose foundational model can be augmented with additional training and underlying weights to hone the engine’s accuracy and precision around a particular subject matter. Additional training comes in the form of an annotated dataset to augment the ground truth of the underlying conversational engine. From a technological perspective, the conversational based engines utilize various Neural Network algorithms for training, including Convolutional Neural Networks and Recurrent Neural Networks and using natural language processing structures such as bi-grams, tri-grams, and n-grams.

The popularization conversational AI engines were sparked by the IBM Watson debut on Jeopardy and catalyzed by the growth of cloud computing Platforms as a service (“PaaS”) by large hyperscalers such as IBM, Amazon, Google, and Microsoft around the 2016 timeframe. These platforms as a service lowered the barrier to entry for companies seeking to build and train conversational systems due to the accessible nature of compute over the cloud. Furthermore, “Do It Yourself” based tooling, ranging from pro-code to no-code, democratized the ability to augment Neural Network based models through visual editors to enter training data, tag, and redeploy.

AI Systems started gaining industry and company specific knowledge added to their ground truth and models soon became differentiated based on augmented training. Additionally, tooling and reporting capabilities opened the aperture to human-in-the-loop feedback models whereby speech and data scientists unlocked the ability analyze the efficiency of intents or accuracy of models and reinforce the underlying training data.

Conversational AI Engines work largely on probabilistic outcomes, whereby the engine will generate the output with a probability score, and the responsibility of writing logic to set a confidence threshold or graceful failing scenario is up to the conversational or application designer.

The proliferation of Conversational AI Engines, as a la cart SaaS services, sparked the creation of an ecosystem outside of traditional channels. Text-to-Speech was used to generate synthetic audio for online lectures videos, text was mined for keywords to augment eCommerce stores, sentiment analysis aided in responding to online reviews, corpora were trained on legal papers to find prior art, and many more unique use cases. More interestingly, popular combination of Conversational AI systems created reusable patterns for others to replicate.

Examining the Conversational AI engines during this era:

Speech to Text — Speech to Text models became more realtime in transcription. These engines often came with a set of language packs as part of the foundational layer of the model to provide the system with a baseline grammatical and lexical understanding of the language being spoken and transcribed. For domain-specific training, Domain Language Models as well as Wordsets (literals, etc) could be layered on top to provide that additional level of industry knowledge required to bolster higher accuracy for a specific use case.

Natural Language Understanding (NLU) — NLU capabilities, in tandem with Speech to Text, unlocked the capability to recognize and categorize base intents from an utterance. Visual tooling enabled intents to be built, requiring a minimum set of utterances a user might say or type during an interaction, including all possible variations of those statements. Within those statements, entities could be defined explicitly for required information to be captured. For example, when booking an airline flight using an automated system, the system would need to identify (1) Origin, (2) destination (3) Time preference (4) Airline company (5) etc, all of which could be defined in Entities. NLU capabilities allowed for intent-based mapping within conversational logic, slot filling prompts to gather missing information, and trigger lookups or other actions based on the inputs collected.

Text to Speech — Speech synthesis remained utilizing Unit selection backend algorithms, however, with the advent of Neural Networks, a level of audio smoothing was able to be layered on top to remove the concatinative nature of the audio being streamed out. Furthermore, the ability to create custom voices and/or fine tune audio output using a semantic based approach unlocked during this era. To build a custom voice, a voice actor/actress could record approaximetly 40 hours within a professional studio, reading a crafted script that recorded the talents vocal intonation, prosody, pitch, and other characteristics used to create a vocal profile. Alongside orthographic transcription of that script, both files were inputted as ground truth, generating a custom voice to be used in dynamic speech synthesis.

Dialog — Companies started shifting away from standards-based logic and started developing proprietary bot logic frameworks that were optimized for their AI Engine integrations. This created a level of lock-in and high switching cost for adopters when selecting a framework of choice.

Sentiment Analysis & Emotion Analysis — With advancements in natural language processing, the ability to read strings of words and assign polarity- positive, negative, neutral- to either a set of words or a document as a whole. Emotion analysis took an additional pass at assigning primitive emotions such as angry, content, neural, joyful, etc. These engines and associated confidence scores were popular with eCommerce, marketing, and sales verticals.

Image Classification and Analysis — Image classification capabilities emerged whereby systems could identify general objects within a square-like boundary within an image with a confidence score. Generalized models as well as trainable ones were available and started making strides within the medical field, for example, for identifying particular tumors. This engine also expended to facial recognition capabilities within a similar confine.

Over time, Neural Network capabilities grew more mature and further integrated into Conversational AI systems, particularly:

Transcription — the ability to perform batch transcription from a long set of audio files en masse as computing power increased

Generative End to End Speech Synthesis— backend model improvements to significantly cut down on the training time of a synthetic voice and converging its sound to near human-like levels, measured on a Mean Opinion Score (MOS) benchmark. Furthermore, Neural backend algorithms allowed for synthetic voices to have cross-lingual adaptation, whereby a voice talent speaking a native language could create a virtualized voice that could speak fluently in an completely tangential language, for example, an English speaker sounding fluent in Mandarin Chinese.

Natural Language Processing — next word prediction capabilities started proliferating and integrated into text editors such as email, blog platforms, and the like to help expedite text drafting.

3. Generative Intelligence — Artificial Intelligence that generates knowledge and converses with humans

Generative AI, lead largely by the introduction of the transformer, an algorithm with encoder/decoder capabilities that parallelize training of large language models — models containing billions of tokens (words) — and proliferation of attention mechanisms, allowing the model to maintain context beyond n many words but throughout the entire conversation.

Most specifically, Generative AI reinvents the modality in which humans interact with the machine. While previously, logic was inputted into the system and models were trained with industry knowledge, with Generative AI however, the foundational layer is hyper-expanded with a vast set of knowledge trained on the public internet, namely Wikipedia and other general knowledge sources and interaction is done through prompts, carefully worded sentences, phrases, and paragraphs written in natural language. These models can be further trained through providing examples, often called multi-shot learning, or with no examples at all, called zero shot learning.

Generative AI plays a foundational role in the transformation of natural language with provided content, from summarizing long documents, generating new forms of text with contextual applicability (e.g writing a hyku, pretending to be {role or person}, or rephrasing but in an {example metaphor}). Generative AI largely is playing the ability to expedite the creation of content, from new dialog flows, to ideas, to images.

While Generative AI currently emerging as the next era in Artificial Intelligence, many new core engines are solidifying:

Image Generation — systems receive a prompt input and gradually use an encoder/decoder to either transform existing knowledge within a corpus or create from scratch new images, making quick iterative changes until the final image is rendered.

Chat — conversational capabilities allowing for the human-like exchange of words with a system containing a broad knowledge base. These chat interfaces can generate documents (cover letters, mortgage pre-approval letters, step-by-step instructions, etc), impersonate/act as others, as well as modify explination for a particular subject matter metaphorically.

Translation — realtime translation between inputted spoken word or text into a target language

Programming — prompts can be inputted as comments into an integrated development editor (IDE) and code will automatically be generated as a template for the engineer to modify. One example is GitHub CoPilot.

Summarization — search engines can scrape web pages and general knowledge of the internet to provide answers and associated references to inputted requests / inquiries

Evolution of Conversational AI Engines through the Eras

Consolidation of Thought

While GPT-3 and transformer models more broadly are running course through the “Peak of Inflated Expectations” (with ChatGPT arguably the “Technology Trigger”), organizations are working through juggling the balance of applicability of AI engines and capabilities across the various three eras currently in the “Plateau of Productivity.”

I argue, based on the above framework, that all forms of AI capabilities have applicability in today’s solution set based on the intended functionality, probabilistic nature of output, level of integration amongst them. As organizations rush to incorporate Generative AI capabilities and ride the current hype cycle wave, I urge for and encourage tactfulness in its incorporation, understanding of the underlying model- including assumptions, training, and limitations-, as well as level of receptiveness from the population at large.

The Three Eras of Conversational AI — a framework for explanation

Eras:

Consolidation of Thought

Written by Sam Bobo