What’s in a Voice — Speech Research at ObEN

ObEN Writer
ObEN
Published in
5 min readApr 12, 2018

ObEN is actively looking for visionary talent to join our team and help develop the next generation of speech technology, and in this blog we offer an intimate look at the technology that powers our Personal AI and some of the projects our speech team is working on. Want to know what it’s like working as a researcher at ObEN? Read our blog here. If you want be a part of our world-class team of scientist, researchers, and engineers visit the ObEN career page to learn more about our job perks and current openings.

Speech is an extraordinary thing. Through a complex web of muscle movements in our face and vocal cords we readily transfer our deepest emotions and wildest imaginings into the mind of another, where they take root and give birth to new thoughts, new sounds, and new speech. For almost a decade, the rapid advancement and adoption of mobile and personal smart devices have engendered new layers of complexity in the study of speech and how we communicate with each other in an increasingly connected, increasingly digital world.

Team ObEN — including our groundbreaking Speech Research Team

Speech is an integral part of ObEN’s Personal AI (PAI) — which allows users to create intelligent avatars that look, talk and behave like them. In creating new ways for people to control and use their digital identity, we’ve taken great care to ensure that speech — that intimate communication through your own voice — is not lost in the digitization of self. We do this through a combination of cutting edge research, world-class talent, and an indelible respect for the power of words given voice.

This has lead to our breakthrough technology, which allows our speech engine to create a user’s personalized TTS using only minutes of speech data. Developed in house by our speech research team, with guidance from advisers including Dr. Abeer Alwan of UCLA and Dr. Simon King of the University of Edinburgh, the engine can be quickly adapted to any voice. It consists of two key components — a voice model and an AI system that learns to control the model to produce the appropriate voice.

What’s it like to work in Speech Research at ObEN?

This adaptation process is much more complex than simply altering volume or even pitch. It’s fitting the timber of a person’s voice, their natural speaking rhythm, intonation, and unique vocal characteristics. Capturing the characteristics that make a voice come alive — and doing it quickly and accurately — involves layers of research not found in traditional, neutral TTS voices like those employed by most smart assistants on the market. Our research is about returning personality to the voice — specifically your personality.

The voice model in ObEN’s speech engine is based on a statistical parametric modeling of the human voice production system. This model is highly flexible and can reproduce a wide range of voices. An AI system is trained to control the model — like an apprentice puppeteer, the AI is trained from text-audio examples to control the voice model in order to produce speech that corresponds to a given text input and with the vocal characteristics of the individual user. By feeding the system a greater variety of data from speakers with different speech patterns, the system can continually improve its knowledge of how various elements affect a voice, how voices differ from one to the other, and how it can adapt to a new voice from smaller voice data samples. The resulting technology means that instead of sitting in a booth for hundreds of hours recording scripts for a custom TTS, our users can just speak a few lines into their mobile phone instead.

Improving this system requires data acquisition and analysis that is greatly facilitated by the use of AI technology. In the near future, the training of the AI system that controls the model will be facilitated by blockchain technology as well. ObEN is an early adopter of blockchain technology in the area of speech research; because user’s who create their PAI using ObEN’s technology can secure them on the Project PAI blockchain, it provides a quick and easy way to securely and continuously share new voice data (though of course always at the user’s discretion). This constant flow of data helps improve quality of sound, fit to user’s voice, and natural expressiveness.

ObEN utilizes its speech engine in two key ways. First, the consumer version of our technology allows any user to create their own customized TTS by installing one of our applications and speaking just a few lines. This version of the technology can mimic their voice prosody, allowing their PAI to speak in a good approximation of their actual voice. In addition, we also us the speech engine in the creation of our celebrity PAI. In this case, our team captures more voice data to create high-quality TTS that captures both the prosody and color of a celebrity’s voice. What traditionally takes days in the recording studio, our technology can achieve with only 1 hour of voice data.

ObEN’s speech technology is unique in that it not only enables users to create Personal AI that speaks in their voice, it also enables them to convert their speaking voice into their singing voice, and also speak in multiple languages. We’ll cover the unique speech to singing (STS) technology in another blog. As for multilingual voices, the research involves matching the phonetic and prosodic spaces from one language to another. This allows the user to have a TTS in their own voice even in languages they don’t speak. For example, through audio samples in English, we are currently exploring conversions into Chinese, Japanese, and Korean.

Our CEO’s Multilingual Personal AI

ObEN’s TTS system affords anyone the opportunity to create a Personal AI that sounds like them, and can also serve as a way to personalize nearly any type of voice or smart assistant. Imagine, instead of a default voice on your smartphone, laptop, or smart home product, you can personalize it with the voice of your family, friends, or even your favorite celebrities. As individuals create their PAI on the Project PAI blockchain, they can harness the open-source technology and create such personalized experiences.

The holy grail of this research culminates in the ability to power expressive speech in the digital realm. Ultimately, our researchers plan to have our fast, personalized, multilingual TTS recognize and produce expression in speech much like a human would. Using an array of acoustic and linguistic cues, our speech team are developing even more ways to personalize and improve our technology, bringing soul and spirit back to the words of our digital selves.

Join our Community

Our newsletter subscribers get exclusive access to beta applications and news updates. Subscribe here. Follow our journey on Twitter.

--

--