Real-time speech translation in 5 minutes. Using Python, AssemblyAI, DeepL and OpenAI
Here is a quick tutorial how to create a real-time language translation code from English to basically any popular language. Lots of integrations but still a single file code. You can skip directly to the code here.
📘 How it works
My goal was to build a real-time solution that translates from Egnlish to Polish language. Ideally a speech as the result. After some investigation I found out that there are several ML/AI services that do the job near perfect and pretty fast:
- AssemblyAI for real-time text-to-speech conversion;
- DeepL for a quality text translation;
- OpenAI for an expert speech-to-text.
Solution in action:
What‘s going on inside:
- The Python code grabs speech via mic and sends to AssemblyAI.
- Once chunk of text is formed it’s then sent to DeepL for a translation.
- Once translated text is ready, OpenAI service is asked to form an mp3 file.
- And finally, mp3 file is played in speakers.
👨💻 The code
It will take more time to register accounts in all 3 services and generate API keys than write code actually. Here down below I presume that you have your keys in hands already.
Setup routine
Download and install Python if you do not have one: https://www.python.org/downloads/. Probably it’s already delivered by your OS already. Check what’s the current version in console:
python --version
It should be at least 3.10. The code below was written in 3.11.
Setup a dedicated virtual environment for the project, for all dependencies to be installed locally.
python -m venv env
source env/bin/activate
Now let’s install all dependencies for the project. You can find requirements.txt in here.
pip i -r requirements.txt
Write all API key values in the following format in .env file. It will store secret credentials independently from the code:
OPENAI_API_KEY=your_value
DEEPL_API_KEY=your_value
ASSEMBLY_API_KEY=your_value
Main application
It’s located in main.py — the default starting point for apps in Python. At first, import all necessary libs:
# OpenAI SDK
import openai
# OS lib for reading secret keys fron environment
import os
# lib for reading secret keys from .env and saving in environment
from dotenv import load_dotenv
# AssemblyAI SDK
import assemblyai as aai
# DeepL SDK
import deepl
# libraries needed to play mp3 locally
from pydub import AudioSegment
from pydub.playback import play
# inbuilt package to measure time
import time
At second, initialise secret values (store as environment variables), init API keys and OpenAI client:
load_dotenv()
translator = deepl.Translator(os.environ["DEEPL_API_KEY"])
aai.settings.api_key = os.environ["ASSEMBLY_API_KEY"]
openai.api_key = os.environ["OPENAI_API_KEY"]
client = openai.OpenAI()
Write standalone functions. For generation of a mp3 file by connecting to OpenAI. Please note, here we use voice “nova” and standard quality “tts-1”. When audio is done we can see how much seconds it took to generate:
def gen_speech_file(speech_file_path, text):
st = time.time()
response = client.audio.speech.create(
model="tts-1",
voice="nova",
input=text
)
response.stream_to_file(speech_file_path)
print('to speech for:', (time.time() - st), 'sec')
return speech_file_path
Function for playing any mp3 file using speakers:
def play_audio(speech_file_path):
audio_clip = AudioSegment.from_mp3(speech_file_path)
play(audio_clip)
Now it’s time for the main logic. It’s located in a single function. It’s called in case when AssemblyAI is ready with the full phrase just pronounced or a part of it. We don’t want to react on each new word but start acting only AssemblyAI is finalised the phrase conversion:
def on_data(transcript: aai.RealtimeTranscript):
if not transcript.text:
return
# if the sentence is final, let's act
if isinstance(transcript, aai.RealtimeFinalTranscript):
# ask DeepL to bring the translation to Polish language
result = translator.translate_text(transcript.text, target_lang="PL")
print(transcript.text, end="\r\n")
print("PL: " + result.text)
# call function to generate and audio file
gen_speech_file("speech.mp3", result.text)
# play it in speakers
play_audio("speech.mp3")
else:
print(transcript.text, end="\r")py
Since AssemblyAI SDK is an event reacting system, we need to define several utility methods:
def on_open(session_opened: aai.RealtimeSessionOpened):
print("Session ID:", session_opened.session_id)
def on_close():
print("Closing Session")
def on_error(error: aai.RealtimeError):
print("An error occured:", error)
on_open/on_close are called to wrap the translation session. on_error is called when smth unexpected happens.
And now, it’s time to define main AssemblyAI transcription object and run it:
# init the main object
transcriber = aai.RealtimeTranscriber(
on_data=on_data,
on_error=on_error,
sample_rate=44_100,
on_open=on_open,
on_close=on_close
)
# connect to AssemblyAI servers
transcriber.connect()
# open a stream for mic
microphone_stream = aai.extras.MicrophoneStream()
# stream the audio into the main object
transcriber.stream(microphone_stream)
transcriber.close()
Again, the full code can be found in this repository.
🏁 Now it’s time to run and enjoy your personal real-time translator!
python3 main.py
🤗 Results
Transcription from speech to text and automatic translation works very quickly. Modern neural models have great accuracy. And it operates really like a human interpreter — AssemblyAI returns text portions once it’s sure the sentence is complete and well understood. And thanks to the great speed of both services (AssemblyAI + DeepL) the translation appears in a wink of an eye.
But text-to-speech is a bit laggy. It works pretty good when a speaker makes pauses between words, sentences. But when pronunciation speed is higher than some threshold, the sound starts lagging. And I can understand that — it’s a tough task. And for the time of the article — only OpenAI is supporting this wide variety of output languages.
💰 The cost:
- $0.47 for 1 hour of real-time recognition by AssemblyAI;
- 500k characters free p/month for DeepL and then $25 for 1M characters more;
- $15 for 1M characters when generating a voice by OpenAI.
Example: the non-stop 1 hour speech will cost $1.64 below first 500k. And $3.2 plus $5 monthly fee for text > 500k.
This level of automation wasn’t possible a year or two ago. But thanks to fast AI services and the ease of use of Python you now can build your own customised online translators literally in minutes 🥳
Photo by Nick Fewings on Unsplash