How To Make Unity Speak
Using The Tech Behind Amazon Alexa for Text-To-Speech in a Unity Game
This post is about using synthetic speech generated by Amazon Polly to read out text in our game Askutron Quiz. To find out how the synthetic speech audio is actually generated check out this blog post by my brother:
When we at Goldsaucer decided to build Askutron: The Quiz Show Game we had to decide how we would get the voice over…medium.com
Making use of the synthetic speech generated on our server using Amazon Polly is relatively easy in Unity. For this we first need an URL that tells us where to download the audio files for each question and answer. While you could download or even stream the audio directly from Amazon’s web service, we’re using a Ruby app on our server as a mediator which caches once generated audio files to reduce costs and improve performance.
Our game gets these URLs in the quizzes which are supplied to the game as JSON files that look similar to this:
For questions that have been created by us ahead of time the audio files have already been generated and an URI reference to the each of them is found in the JSON data. Questions created by users are generated on the fly, in which case the game will dynamically generate an URL such as “/audios/playback?ssmd=You+said+*what*?&language=en” to query the audio for arbitrary phrases.
A simplified example of how to load and play the questions this way is shown in the following code which defines a TextToSpeech class that uses our server to query audio as needed.
A class similar to this is used in Askutron Quiz. As you can see the audio is either downloaded using the supplied audio URI or a dynamically generated URI that contains the text as a query argument. For security reasons these requests are only accepted on the server when the right HTTP headers such as an authentication token are provided.
Obviously this code works for any audio downloaded from the server, not just synthetic speech. If you don’t want to generate the audio on a server with Ruby like described in Markus’ blog post, there is an AWS C# SDK you can use to do this with C# and even directly in your game. For an example on how to achieve this see Chris Bitting’s great blog post about using Amazon Polly from .NET / C#.
The AudioClips are downloaded and created using Unity’s built-in WWW class and the GetAudioClipCompressed extension method. It doesn’t matter if the audio is loaded from the server or directly from the file system (using file:/// URLs). During the introduction of each round all questions and answers are loaded and subsequently the audio files are played in order to read the questions and answers out aloud.
The actual code used in our game naturally is a bit more complicated than that, though this is what it boils down to. Amongst other things it adds error handling and most importantly a local file cache to avoid downloading the same audio file twice.
Amazon Polly currently supports around 18 languages and some variants for English, Portuguese, Spanish and French. Thanks to this our game can read content in any of these languages and we can fully voice even quizzes created by players using the editor that is included with the game.
Overall I’m pretty happy with the quality of the synthetic speech generated by Amazon Polly. There are some rough edges here and there, but at times people didn’t even realize the speech was synthetic when we showcased the game. Mostly it just works. And if not we can still fine tune the pronunciation using SSMD.
If you want to see this used in action, feel free to check out Askutron Quiz on Steam!