Some of my best friends from my time as a missionary (yep, I’m the dorky guy on the far right)

Creating a simple speech to speech translation pipeline

Logan Van Wagoner

Published in

Analytics Vidhya

5 min readMar 5, 2021

Using microsoft azure services we will create the following pipeline:

ASR (automatic speech recognition)
MT (machine translation)
TTS (Text-to-speech)
(optional) Use neural voices to improve the output speech

My desire to speak other languages came when I was 18 and went to live in Accra, Ghana. I was a missionary for the Church of Jesus Christ of Latter-day Saints, called to serve the people I would meet there. In Ghana over 50 languages are recognized and where I lived (I moved four times) there were 3 or 4 main languages spoken. I wanted to connect with and serve these super great people, so I did what I could to learn their languages.

Almost 6 years now since I got back, I’m studying machine translation and looking to make my first attempt at making MT a thing for languages that I spoke like Twi. This article isn’t those efforts, but a starting point for people interested in machine translation.

This is not an in-depth look, or from scratch, article either. This is showing how to use some very powerful tools in a basic way, just to get you started. Most of this can be found in following along with Microsoft’s own quickstarts for the azure speech service or pyaudio’s documenation.

Honestly, I’m very impressed with Microsoft azure. In a machine translation course that I’m taking we found that for several languages (maybe more, but we haven’t tested them all) it was outpacing Google translate in accuracy (mainly tested with bleu scores). Plus the TTS allows you to improve the voice, which is a ton of fun!

Let’s get started with code and a basic understanding of what is happening.

ASR

Automatic speech recognition takes some pretty serious work. If you want to work with a little more exposed example you can try using pyaudio. They are awesome for experimenting with. Let me show you an example of that and then the Microsoft one and you can decide which you want to proceed with.

This file above lets you record and listen to .wav files. It’s nice in that you see a little more behind the scenes of what is going on. There are lots of wonderful articles and writings that explain this better than I will take the time to here. One little technicality to note is that if you are using vs code on a mac, and you try to call record your mic won’t turn on. Try this if that is case.

Now let’s move on to azure’s ASR. To set up your speech service you’ll need at least the free azure account (for testing this is plenty). Once you that, go here and search for ‘speech’ and click ‘create’. You don’t have to change anything here other than selecting your subscription and pricing.

Once it finishes setting it up, head to the resource and click the ‘keys and endpoints’ on the left to get one of your two keys and copy it in place of the ‘yourkeyhere’ below.

Boom! Now you can transcribe your speech into text! And with the backing of a multi-billion dollar company, it’s pretty accurate.

You could call this class, create a simpleASR variable, and call from_mic() or from_file(filename) and away you go. 🥲

MT

Machine translation, this is a massive field of research all on its own, but that is a discussion for another time in your life. What is so cool is that the training and fine tuning of a very powerful NMT is accessible to us, and it can help us start our MT journey. So, for now, rather than building an NMT from scratch, we will focus on using azure’s speech MT. This requires a ‘translation’ resource which you set up the same way you set up your ASR. A reminder to wait while it finishes setting up so that you can find it later (I made that mistake and took a long time wondering where it went only to have to make it again).

Once you have your api key, the following code will get you started.

A very basic function that takes a list of phrase(s) and returns you the text of what has been translated. This will be called from a master file in a little bit. But first we will finish constructing our components.

TTS

Text-to-speech is a very fun part of this, especially if you continue to option four. You may be able to use your original speech api key, but you’ll need to have a specific region region set to use neural voices. You can find your best region at that link and set it when creating this speech resource.

Now for the code:

This class gives you the ability to write to a file or speak or both with some customization included. The to_file_with_ssml and speak_custom functions both use neural voices. I’d definitely try both to see the difference!

Combined pipeline

Now to put it all together, a simple file helps us easily call each service. Some parts have been commented out that you can play with to see both how pyaudio works and the changes other options make.

You’ll note that you can use pyaudio data with your microsoft asr or just use the asr.py class to record, both will work! You may have also noticed calls to some Spanish Dalia SSML. Which leads us to neural voices.

Neural Voices

Microsoft is working on neural voices. You can even create a custom voice and add to this (you might need a lot of resources for now though). What they already have in existence is a lot of fun to play with. Here is a simple example a SSML file you can use to enhance your Spanish translations.

I’d recommend checking out both to see what the difference is, or even trying another Spanish voice like ‘Jorge’.

Also, if you want to take it further, keep reading in their documentation to see how you can add pauses, inflections and give the voice emotion!

There you go! A basic STS pipeline for you to play with. One of the reasons I didn’t just use the existing speech to speech pipeline Microsoft has is because the modular build allows those who want to make changes or use custom components of the pipeline the freedom to do that. Plus, if this is your first time, setting up each part of the pipeline helps to build a more complete understanding of what is going on.

Enjoy translation! A next step to try would be rating the translation with something like a bleu score (NLTK and sacrebleu can help you there). Then trying to see what you can implement on your own of this pipeline removing the Microsoft components.

As for my work mentioned above, I’m currently working on low resource language NMT, specifically a project for English-Twi translations. If you’d like to join in or see what is going on, let me know and have fun!

Creating a simple speech to speech translation pipeline

ASR

MT

TTS

Combined pipeline

Neural Voices

Written by Logan Van Wagoner