Sam Bobo
Speaking Artificially
3 min readDec 30, 2022

--

Image by Midjourney

Speechlessness in a Digital World — the true cost of taking away voice

Take a moment to reflect on the world of digital engagement, specifically among people and with the brands they interact with. Too often people seek to posting a text-based message on social media or sending an SMS message to a friend; subsequently companies strive to deflect inbound calls to digital modalities like chat bots and web pages to reduce tapping an agent to assist with a customer problem. “Voice is a legacy modality!” claim analysts who study the Customer Engagement space, “The future is digital!”

The question I pose within this blog post is: “What is the power of Voice modalities and what is lost when engaging through text-based and digital channels? As a Product Manager specializing in Conversational AI technologies, I’ve found this question quite intriguing, especially as I began engaging with colleagues in strategy-based conversations on the matter.

Voice - The inflection, tone, prosody, rate, volume in which voice-based communication is delivered contains significant context lacking in today’s conversational systems. As opposed to text, which I’ve claimed is a first-dimension method of communication, voice is a second dimension, delivering higher contextual conversational cues which are vital for engagement.

Whether the message people are trying to convey, the real-time thought process arriving at that message, or the underlying feelings driving them, voice conveys a significant amount of context unique to its modality:

Message

  • Punctuation — Humans almost naturally imply punctuation in our speech, most notably questions. When humans pose a question, our vocal inflection always curves higher at the end of a question-based sentence. When performing speech recognition, many of these systems utilize energy detection and base language model to transcribe the question mark.
  • Sarcasm — While not typical within brand engagement, human-to-human engagement could involve sarcasm as a manner of comedy, or disgust. Sarcasm can often be detected through voice whereby text-based mediums might get lost or miscommunicated. People and systems require alternative responses when facing sarcasm that can be more readily detected in speech.

Thought

  • Stream of Consciousness — one’s stream of consciousness, or continual portrayal of thought process, is abundantly prevalent in voice-based communication over text, which is typically hidden in three ellipses or an indicator of active text or typing. In real-time, one can understand mental pivots in thought, sudden realizations, and logical rationalization as voice gets communicated in real time as opposed to intent captured or text written on digital modalities.
  • Fillers — phrases such as “umm” and “hmm” are often inserted within breaks of speech to fill an awkward void. They often indicate incomplete thoughts or changes in thought that could provide underlying context to a problem or statement one is making.

Feeling

  • Emotion — Whether crying or laughing, using a stronger or softer tone, or speaking faster or slower, human voice is emotive. When faced with emotion, people tailor their responses towards empathetic (if comforting) or anger (if escalating further) that can only be interpreted via diction when approached with text-only mediums.

As illustrated, voice is a powerful medium! It’s understandable that, in the realm of customer engagement, long-tail questions — less common requests or questions that involve higher levels of research and thought — are handled via voice mediums and typically involve human engagement. This is demonstrated further as humans prefer to engage in deeper conversation in person (physical or digital) and using voice rather than text.

I’d argue that our systems today inadequately capture the true value of the vocal medium. Simply transcribing speech into text loses degrees of freedom and restrict the amount of information conveyed to solve the current problem or inquiry at hand with the correct manner, tone, and diction of response. The work done by Hume.ai on Speech Prosody starts to drive towards this type of vocal analysis for conversational systems. I personally believe that new emerging models that focus heavily on the uniqueness of speech will start to gain advantages in the market. Next time the decision comes to communicate via text or delect to that medium, please think about what value is being lost, regardless of the cost.

--

--

Sam Bobo
Speaking Artificially

Product Manager of Artificial Intelligence, Conversational AI, and Enterprise Transformation | Former IBM Watson | https://www.linkedin.com/in/sambobo/