Google Wavenet vs Amazon Polly
Last few days I’ve been busy migrating Read2Me from AWS Polly to Google Wavenet.
Was it an easy process? Certainly not — the terribly broken playground, 400 Bad Requests with no reason included for the bad request inside the response and lack of native support for concurrency (in order to speed up the conversion process) made this a far more difficult implementation than AWS Polly, whose SDK is of highest quality and supports a breadth of features.
Google was kind enough to give us a free, modern looking/material designed playground for its TTS services (AWS did the same thing, but its UI is far less aesthetically pleasing).
The reality is the thing barely works for Wavenet voices — it might work if you supply a short sentence or two, but for any meaningful testing it’ll simply hang. Not a great starting experience, and it is what actually put me off from implementing it after Google offered it as a SaaS about a month ago.
The implementation chronicles
Google is a hi-tech company that produces hi-quality software, but why its PHP SDK suck so much ass is beyond me — it lacks clear documentation on how to achieve what you’re after.
It doesn’t even give you a built-in way of issuing API requests concurrently. Even after I wrote the async generator, I had to spend another full day figuring out how to handle rejected promises, since their Guzzle client uses their own handler so the 400 Bad Request responses don’t bubble up as exceptions (which is how Guzzle figures out that the promise had been rejected).
If your request is invalid, the API will tell you what was wrong, but there’s no way of getting that information from the SDK — I’ve spent 4 hours figuring out why my custom SSML are tags getting ignored, it turned out to be an unescaped ampersand. which I actually figured out using AWS’ Polly playground.
It doesn’t even have a TTS example! AWS, with concurrency and automatic promise retrying upon rejection out of the box makes this a breeze, so much you don’t even need service-specific examples. It even provides progress hooks so you can easily build a live progress bar on your frontend (which is what I’ve done on Read2Me).
Wavenet’s voice quality
Put simply, Wavenet outperforms Polly in terms of lifelikeness.
I have created a Wavenet vs Polly sample: https://soundcloud.com/nino-kopac-237040096/sets/google-wavenet-vs-aws-polly-a-demo-created-on-read2me
For a more scientific read you should head to this URL: https://cloud.google.com/text-to-speech/docs/wavenet
Polly was the first really good lifelike TTS, but unless AWS makes it better, it’s not going to be able to compete quality-wise (but it does compete price-wise — Polly is 4x cheaper than Wavenet).
The reason why Wavenet is more lifelike than Polly is because of a more dynamic speaking range (i.e. prosody), but Wavenet still doesn’t seem to understand context (please correct me if I’m wrong, I haven’t gone deep into DeepMind’s paper yet). This is changing with the next iteration of TTS (also coming from Google’s DeepMind) and it’s called Tacotron, which I’m eager to implement into Read2Me when it becomes available.
Note that even though Wavenet’s docs recommend OGG Vorbis instead of MP3 as the output format (which I fully support due to Vorbis being an open source codec), that won’t work on iOS devices’s browsers (both Safari and Chrome) or on iTunes, so I decided to stay with Mp3. I wish I knew this before I wrote a Vorbis metadata extractor, which I’m giving away for free.
Finally, keep in mind that as of the time of writing, Wavenet’s endpoints are still in beta, so if you do decide to implement it keep an eye on its release notes for any breaking changes.
Head to Read2Me and give the new Wavenet voice a go — whether it’s by listening to Daily Curated Articles, an article or a document of your own choosing or converting your existing news or blog publication into a podcast using Read2Me’s plug and play widget (no coding required).
Finally, if you’re looking for someone to integrate Wavenet into your PHP app or someone who’s got a lot of experience regarding audio on the web, I may be available for hire on Upwork.