Ushering in the era of speech-to-meaning

Spoken language is an extraordinary thing. Over millions of years, we have evolved the ability to formulate complex feelings and thoughts, and communicate them in all their rich nuance by simply talking to the people around us. Considering the wonderful lexical complexity of the language we use in our daily relationships makes it easy to forget a simple fact: fundamentally, we are emotional beings, and we express ourselves as such. We sigh, we grunt, we scream, we laugh. From the songs we sing to the arguments we have, from the complaint call we make to a business hotline to the one we give our relatives to announce the arrival of a newborn baby: our spoken interactions are literally loaded with emotions.

“Why can’t I talk to my computer the way I talk to my friends?”

As machines become ever more present in our daily lives, so grows our need for natural interaction with them. We’ve heard it for decades, we’ve seen it in countless movies: voice is the future of human-computer interfaces. Machines can now transcribe noisy conversations with over 95% accuracy¹; they can generate speech that is virtually indistinguishable from that of a human being. So why aren’t we already talking to our devices the way we talk to one another? Simply put, because these devices currently fail to understand us beyond WHAT we say, and thus miss the meaning that lies in HOW we say it.

If I am really bleeding to death, I’d expect my AI assistant to be a lot more alarmed!

Again, we’re animals! There’s more in our speech than mere words; in fact, they are just the tip of the iceberg. One year ago, Teo and I decided it was time we move from speech-to-text to speech-to-meaning, and we started OTO (more on our story, and the importance of tone in Teo’s previous post)..

Our master plan articulates around 3 things:

  1. Create amazing technology that allows machines to understand how we speak;
  2. Leverage the ever-growing trove of conversational data we create through our interactions;
  3. Combine the two into products people love to augment humans with AI.

And iterate.

Technology: towards human-level understanding of speech nuances.

The main technology we spun off from SRI International is called SenSay³: at its core, an acoustic processing engine that allows us to do real-time modelling of a speaker’s latent state (think emotions, engagement, intent, …) with an unprecedented level of accuracy.

The result of decades of expertise and years of development at SRI International, SenSay allows us to transform a spoken conversation into thousands of acoustic properties every second, which builds a live “map” of how an interaction is unfolding, and allows us to drill down into the second-by-second acoustic structure of a conversation. Over the past year, we have deeply integrated SenSay into our tech stack, and allowed it to scale to the analysis of thousands of parallel streams.

OTO’s SenSay technology classifying spoken emotions in real time

Once we have created a rich representation (an “embedding”) of the acoustic space of a conversation, we can pair it with the words therein to augment traditional Natural Language Processing (NLP) methods, mostly focused on text. Merging the acoustic and lexical dimensions of an interaction into a multimodal embedding is a novel approach we call Acoustic Language Processing (ALP). Early results already show that combining the acoustic and lexical dimensions allows us to halve the error in classifying emotions in speech, showing the complementarity of these two sources of information (these results will be detailed in a future post!). By allowing the automation of text mining at scale, modern NLP methods brought us breakthroughs in thorny problems such as machine translation. Similarly, the introduction of ALP as a unified modelling framework for human interactions will unlock the meaning hidden in HOW we communicate with others, and allow machines to differentiate a heartfelt from an irritated “thank you”⁴.

Data: learning the subtleties of human communication from millions of hours of conversations

You’ve heard it before: if artificial intelligence is the planet we’re headed to, machine learning is the rocket, and data is the fuel. The analogy, corny as it is, underlies a hard truth: nowadays there is no AI without data, and more/better data usually beats better models. We bootstrapped OTO’s ALP models with a proprietary data set of emotional utterances by human actors. At hundreds of hours and thousands of speakers, its size and diversity dwarf any public data set of human emotions.

While we started with raw emotions, our real goal is to build a deep understanding of real-world conversations with an emotional undertone; which better place to start than call centres, which receive millions of such calls every day? This is where OTO’s treasure trove lies: over the past few months, we have already accumulated over 10’000 hours of business conversations with associated metadata (for example success of sales, satisfaction, churn), with the aim to reach 1 million hours + metadata by 2020.

In the same way that the machine translation models mentioned above required millions of documents to learn linguistic pairings, we are building a huge reference corpus to understand how humans communicate in business conversations. This is how we aim to enable what I call the virtuous data circle (a network effect): more data begets better models (more diverse predictions, higher accuracy), built into better products, which in turn unlock more data.

Elevating user experience, and delivering voice-intelligence-as-a-service

Every journey starts with a small step: while our eyes are set on the horizon, our hands are busy working on very concrete use cases. One of our first projects consisted in predicting the conversion of a sale from the acoustic structure of the conversation. To this end, we analysed 4’000 hours of inbound sales conversations with approximately 50% conversion rate. We trained our deep learning models to capture the “acoustic signature” of a successful sale in the combination of the agent’s input and the customer’s reaction, and evaluated them on recordings OTO had never heard. Remarkably, we reached 94% accuracy in predicting the outcome of a call from its acoustics alone, which you could compare to listening to a muffled conversation through a closed door (you hear the intonation but not the words).

As we found out that we could accurately model the outcome of a call from its acoustic structure, we turned this use case into a real-time coach for call centre agents, helping them improve their tone of voice during sales conversations.

OTO’s real-time “Engagement Coach”, helping call centre agents sound upbeat and empathetic

We ran a controlled trial across 40 agents earlier this year, and the results were unequivocal: agents using OTO as a coach saw their engagement score increase by up to 20% compared to those who didn’t, and their associated sales conversions by about 5% as a result of their improved tone. We are now deploying the a scaled-up version of our system to simultaneously coach about a thousand agents across the US.

A view of the OTO live dashboard showing engagement statistics for the past hour

Creating such an acoustic understanding at the conversation level is just a first step: we are currently at work on use cases that will allow us to precisely identify the turning points in an interaction. Which specific sentence, which exact word produced excitement, frustration, how did an interaction result in an affective decision? Through the analysis of business conversations, we are building our vision of creating a full ecosystem around voice-intelligence-as-a-service, with a standardised interface usable by anyone who wants to humanise their voice application.

Join the voice intelligence revolution

Build innovative technology, use it to unlock data, merge the two into great products, and iterate. I make it sound simple, but in reality it requires the patient dedication of a talented team to build and ship quality software; day in, day out. If you’re passionate about any combination of speech technology, signal processing, machine learning, and product in general, we’re hiring on the Zurich and San Francisco teams: reach out at

Our commitment doesn’t stop at our team and our customers. We are aware that today’s rapid progress in AI is made possible by the huge efforts of the community, and to contribute back we will release some of the building blocks of our tech stack as open-source software. We are also committing to the long game by creating partnerships with public research institutions, so the impact of our technology goes beyond the market we serve, and ultimately benefits society as a whole. More on all this, soon… We can’t wait to share what we have in store!

Nico & the OTO team

1. “The Business of Artificial Intelligence”. Harvard Business Review. July 18, 2017.
2. For example: “Google Duplex: An AI System for Accomplishing Real-World Tasks Over the Phone”. Google AI Blog. May 8, 2018.
3. “Customer Service Bots Are Getting Better at Detecting Your Agitation”. MIT Technology Review. September 14, 2016.
4. More in our previous post: “Introducing OTO”. Medium. July 24, 2018.