As Google’s DeepMind releases technology Wavenet, that produces computer generated voices nearly indistinguishable from a human’s, should DeepMind be creating this? We saw Google’s demo of the very human voice phone call to the hairdressers at Google IO this week and the mixed feelings that caused with those who watched it.
Should we be mimicking the human voice?
We have seen how voice assistants are creating more and more human-like interactions. Amazon recently announced Echo for Kids (a filtered subset of kid friendly voice skills) but also, they are testing a reward system for when kids say please and thank you. Which if the objective is for Echo/Alexa to communicate as a human then it is good to mirror social constructs, but should kids be able to tell the difference to what voice is a computer and what is a person.
** Only yesterday Amazon now allows developers to pick one of eight voices to use when building an Amazon Alexa skill.
Humanistic behaviour is growing everywhere even in our toilets, that is right I said toilets. On a recent Virgin train service, I got to encounter such an example. The audio announcement upon entering was both human and funny as you would expect from Virgin. It did make me listen to what it was saying rather than a robotic voice that doesn’t connect with me, such as those you hear on the train that reads out the next station and their repeating tone can make you zone out and miss your stop. Once you hit the audible Turing test then you need to focus on the actual content, the words because if a person keeps saying the same thing to you over and over, you would turn off.
It would have been an option for Virgin to offer different variations of that audio and randomise them every 20 minutes, if Virgin had a generated “Virgin” brand voice they could say whatever they needed. Being able to create your computer-generated voice for your brand has a multitude of benefits.
If you have ever worked with voice artists, you will know the difficulty in getting the correct tone, speed, flow of the speaker, optimising for sound interference or getting the same feeling when recording over multiple days. With a generated voice these can be removed, no need to arrange for a voice artist to record the clip, no expensive equipment or the requirement for a quiet place to set up.
With your computer-generated voice, you can open an app, type in your sentence/phrase, export within seconds, then quickly add to your YouTube video, commercial, voice assistant or have it be your podcast interviewer. Your voice can be tweaked for mood, emotion and language. Your voice brand may have to be translatable across languages, dialects, situations and locations; maybe we can even get rid of all those makeup adverts that are poorly dubbed.
When voice is too good
The issue then comes when the voice is too good, what happens when you call into a customer support helpline and your speaking with Jane Smith, she answers your query but later there is a problem, and you call back to complain about the issue and ask to speak with Jane Smith. Will companies identify their computer-generated voices? Though by just removing ranting negative customer complaints from the human mind could have a dramatic positive effect on international customer service agent well-being.
One of the most prominent voice skills areas now is health and wellbeing; the ability to hear motivation to succeed is much higher through a humanistic voice than an automated push notification for virtual kudos points. These voice experiences are getting a mass of positive reviews/engagement, though the next step for this has to be personalised voice experiences.
Your motivation coach has to send you a personalised message; this was easy to do when was a push notification “Dear [NAME], you’re awesome”.
However, a human motivation coach isn’t able to record 100,000 users personalised voice experiences every day to keep its audience engaged. Though with a computerised version of their voice they could train a data model to their voice and be able to generate those 100,000 unique voice experiences within a few minutes.
Anyone will be able to have their own voice generated.
Just like false news and the more recent false videos we have seen on the internet of famous people saying things they haven’t [Obama]. Startups like Lybird have shown us how they can mimic a human voice with as little as 90 seconds of our voice. So, in a more and more digital world will audio become hard to judge what is real and not, just like a news headline is today.
I believe as an industry we will continue to make computer-generated voices more human in their language, tone and verbal emotion because the more human they can be, the more we will interact and engage. Just like video games and their near realism graphics, voice will follow where the money goes. I am not sure if I agree that this should but it’s going to happen, so how can we make a future where we are aware that not all voices are human.
Would you want a computer-generated version of your voice?