Speech Metadata Unleashes the Power of Context-Aware Conversational Agents

Conversational input is about the speech metadata — the people, emotions, environment, and background sounds

Published in

A Cloud Guru

3 min readFeb 20, 2017

Imagine the possibilities of conversational agents if developers had access to speech metadata.

I was recently asked by my friend Bret Kinsella from voicebot.ai for my predictions on AI and Voice. You can find my 50 cents in the post 2017 Predictions From Voice-first Industry Leaders. In the article, I mentioned the concept of speech metadata that I want to explore in more detail.

As Voice App developer, when you have to deal with voice inputs coming from an Amazon Echo or a Google Home, the best you can get today is the transcription of the text pronounced by the user.

While It’s cool to finally have access to efficient speech to text engines, It’s a bit sad that in the process, so much valuable information is lost!

The reality of a conversational input is much more than just a sequence of words, It’s also about:

the people — is it John or Emma speaking?
the emotions — is Emma happy ? angry ? excited ? tired ? laughing ?
the environment — is she walking on a beach or stuck in a traffic jam?
local sounds — a door slap? a fire alarm? some birds tweeting ?.

Imagine now the possibilities, the smartness of the conversations if we could have access to all this information — Huge!

But we could go even further!

It’s a known fact in communication that while interacting with someone, non-verbal communication as much important as verbal communication.

So why are we sticking to the verbal side of the conversation while interacting with Voice Apps?

Speech metadata is all about the non verbal information, which is in my opinion the submerged part of the iceberg and thus the more interesting to explore!

A good example of speech metadata is the combination of vision and voice processing in the movie Her. With the addition of the camera, new conversations can happens — such as discussing the beauty of a sunset, the origin of an artwork or the composition of a chocolate bar!

The main interface in the film “Her” is voice — communicated through a discrete ear plug

One of the many startups starting to offer this kind of rich interactions is Asteria — it’s an “artificial intelligent companion that you carry with you. It sees what you see, hears what you hear, takes in life as you do, and gets smarter all along the way.”

Unleash the Speech Metadata

I think this is the way to go, and there would be a tremendous amount of innovative apps unleashed by the availability of the conversational metadata.

In particular, I hope from Amazon, Google & Microsoft will release some of this data in 2017 so developers can work on a fully context aware conversational agent.

Hicham Tahiri is a French software engineer with extensive professional experience in voice interfaces in both the automotive and the robotics industries.

In 2012, Hicham launched Smartly.ai, a startup that provided voice-based interfaces. When the Alexa Skills Kit arrived in 2015, he crafted a developer toolbox that offered a visual conversation design tool with automatic code generation, a community-generated intents library and a voice simulator.

Hicham also created the Alexa skill Blind Test that plays a short sample of a song and then have to guess the artist. With +100 song samples, it’s a fun multiplayer game that you can enjoy with friends to challenge your music knowledge.

Speech Metadata Unleashes the Power of Context-Aware Conversational Agents

Conversational input is about the speech metadata — the people, emotions, environment, and background sounds

Unleash the Speech Metadata

Written by Hicham Tahiri