Voice Recognition, Translation, and Text-to-Speech on Mobile

4 min readMay 31, 2018

Not multilingual? That’s OK, there’s an app for that. Check out this video to learn how to build it yourself!

YouTube Video — Voice Recognition, Translate, and Text-to-Speech on Mobile

You can now add a professional translator and friendly voice to any mobile app using the iOS Speech API, Amazon Translate and Amazon Polly. If you haven’t tried AWS yet, these two managed services are possibly the easiest API implementation I’ve seen to date.

In this article, we’ll build a mobile app that will recognize our voice and convert it to text (speech-to-text), translate the text to a language of our choice, and convert our translated text to synthesized speech (text-to-speech).

By building this solution, we’re applying machine learning (ML) natural language processing (NLP) to our mobile app by using API-Driven managed cloud services. No data collection, no a model building, and no training. We’ll simply call APIs to make our app smarter.

Here’s an architectural diagram of this solution:

There are two easy steps to building this solution: 1. Configure backend by creating an Amazon Cognito Identity Pool, IAM Role(s), and adding permission to those roles for accessing Amazon Translate and Polly directly from a mobile app. 2. Create a mobile app to showcase natural language processing by cloning my sample app from GitHub and configuring it to use the values created in step #1.

Let’s get started!

Part #1: Configure Backend (1 minute)

I created a CloudFormation Stack Template to automate the creation of the Cognito Identity Pool, and IAM Roles and policies so we can start playing with the app! The other services (Translate and Polly) do not require any backend configuration and will be called directly from our mobile app.

Click on the Launch Stack button

Launch CloudFormation

This will launch the AWS CloudFormation Console and create CloudFormation stack to automate the creation of a Cognitio Identity Pool, associated unauthenticated & authenticated IAM Roles along with policies for accessing Amazon Translate and Amazon Polly directly from a mobile app.

2. Click Next on the Select Template page

3. Click Next

4. On the Options page, leave all the defaults and click Next

5. On the Review page, check the box to acknowledge that CloudFormation will create IAM resources and click Create.

6. Wait for the speechtranslator-stack (this is the name passed by the template) stack to reach a status of CREATE_COMPLETE

7. With the speechtranslator-stack selected, click on the Outputs tab and you should see three rows.

8. Copy the Value for each of the three resources as we’ll be pasting those values into our service config in the AppDelegate of our mobile app.

For this application, we’ll utilize Amazon Cognito, Amazon Translate, and Amazon Polly.

For Amazon Translate and Amazon Polly, no backend configuration is required! However, we do need to create an Amazon Cognito Identity Pool to allow our mobile users to call Amazon Translate and Amazon Polly directly from the app. With an identity pool, you can obtain temporary AWS credentials with permissions defined in IAM Roles to directly access AWS services.

That’s it for the backend configuration! Let’s move onto the mobile app.

Part #2: Create a Mobile App (3 1/2 minutes)

To get you going quickly, I uploaded a full solution iOS Swift app on GitHub here.

Follow the instruction in the README and you’ll be up and running in just a few minutes.

https://github.com/mobilequickie/AmazonSpeechTranslator

Now that we’ve configured our backend API resources and got the app running, let me explain how to the voice interaction works. On the surface, this seems like a very simple application but it really only turned out that way because we utilize some really powerful, yet simple cloud solutions and built-in Apple APIs.

For voice recognition, we’re using the Apple speech API to turn our voice to text. The app then passing the transcribed text to Amazon Translate for text translation into the language of our choice and returns the translated text back to the app. Once the app receives the translated text, it passes it to Amazon Polly, which then provides synthesized speech as streamed .mp3 audio.

Final Thoughts

Pretty straightforward, right? In this article we applied machine learning natural language processing to our mobile app by utilizing API-Driven machine learning managed cloud services like Amazon Translate and Amazon Polly. We left all the learning and model training behind and quickly deployed our mobile app using managed cloud services to provide translation and speech. Oh, and the best part is, we only pay for what we use and don’t have to manage any servers.

Voice Recognition, Translation, and Text-to-Speech on Mobile

Part #1: Configure Backend (1 minute)

Part #2: Create a Mobile App (3 1/2 minutes)

Final Thoughts

Written by Dennis Hills