It’s Show Time: AWS Polly Vs Google Cloud Text-To-Speech

Francesco Malatesta
Francesco Codes
Published in
4 min readApr 11, 2018

This is the english “translation” of an article I wrote for my blog. You can find the original on francesco.codes. Credits to Luca Matteis for the proofreading and Alessio Biancalana for various advices.

A few months ago I launched a little side-project / experiment, cryptoaud.io. Basically it’s a news aggregator for the cryptocurrency world that creates an audio version of every crawled article digest, using the AWS Polly text-to-speech technology APIs. Idea: listen to the news instead of reading it.

Such waves, much spectrum.

I always try to keep myself updated about new stuff in text-to-speech technology because I think it’s one of the most interesting fields in information technology. When I stumbled upon AWS Polly I was really amazed about how well it does the job.

Then it happened: I was in the middle of my morning session of news reading and an article on TheVerge totally captured my attention. Google just released a new cloud service for text-to-speech.

Guess the name? Google Cloud Text-to-Speech. Wow.

Apart from the name originality, I had to try it.

So here I am: in this article I will show you:

  • what this new service does, and how;
  • a quick comparison with AWS Polly;
  • a first basic implementation in PHP;

What is Google Cloud Text-to-Speech / First Test

Let’s see what Google Cloud Text-to-Speech does and what impressed me most in terms of what I was searching for:

  • creation of more natural-like interaction if compared to its competitors, thanks as it has WaveNet under the hood;
  • everything can be done by using some dedicated REST APIs;
  • an interesting pricing model, the service starts with a one milion characters free tier. After that, you pay $16 for the next milion, and so on. The price goes down to $4 for non-WaveNet voices (not interested);

On the Cloud Text-to-Speech project home page you can find a form to test its power. So, I went to one of my sources (Cointelegraph) and picked up a paragraph from the latest article.

“Cisco, a worldwide leader in IT and networking, is developing a method of confidential group communications based on Blockchain technology, according to a patent application released by the US Patent and Trademark Office (USPTO) March 29.”

I put it in the form and clicked on “Speak It”.

Here’s the AWS Polly Version.

Here’s the Cloud Text-to-Speech Version.

There’s nothing much to say: you can easily feel the difference. It’s not just about the result: the file size is ~90kB for Polly audio file, ~750kB for the Cloud TTS one. Obviously you can optimize many parameters in the process but this is what you get with the simplest API call.

Ok, now we know something more about it. We just need to create…

The First Test PHP Script

Now it’s time to set up a basic PHP script to create an .mp3 file from a text using the Cloud TTS APIs.

If you’re lazy don’t worry: everything can be found in this GitHub repository.

In order to work with the Google APIs we will need an API key at least, that you can create from this dedicated page on Google.

Here’s the contents of the script.php file.

<?php    require 'vendor/autoload.php'; 

$googleAPIKey = 'YOUR_API_KEY_HERE';
$articleText = 'Cisco, a worldwide leader in IT and networking, is developing a method of confidential group communications based on Blockchain technology, according to a patent application released by the US Patent and Trademark Office (USPTO) March 29.';
$client = new GuzzleHttp\Client(); $requestData = [
'input' =>[
'text' => $articleText
], 'voice' => [
'languageCode' => 'en-US',
'name' => 'en-US-Wavenet-F'
], 'audioConfig' => [
'audioEncoding' => 'MP3',
'pitch' => 0.00,
'speakingRate' => 1.00
]
];
try {
$response = $client->request('POST', 'https://texttospeech.googleapis.com/v1beta1/text:synthesize?key=' . $googleAPIKey, [ 'json' => $requestData ]);
} catch (Exception $e) {
die('Something went wrong: ' . $e->getMessage());
}
$fileData = json_decode($response->getBody()->getContents(), true); file_put_contents('tts.mp3', base64_decode($fileData['audioContent']));

No rocket science, as you can see. Just a simple post request with JSON body with well defined parameters. You can find their reference on this page on the official docs.

This is the response format:

{
"audioContent": "file_contents",
}

What you can find in “file_contents” is basically the file contents (you don’t say…) encoded in base64. This is why I decoded it before saving the final resulting .mp3 file.

Well… So?

Time is over! After some basic tests I think I can say it: I like Cloud TTS a lot. The final result is definitely ok and, in the end, I implemented it on cryptoaud.io really easily.

Yep, $16 for a million characters is not $4… but hey, “The Adventures of Huckleberry Finn” is only 600.000 characters long.

Not so much after all, huh?

--

--

Francesco Malatesta
Francesco Codes

Developer @ AdEspresso/Hootsuite, Founder @ Laravel-Italia.it, Editor @ Sitepoint. Developer, Curious, Enthusiast. — http://francesco.codes