Real-time speech translation in 5 minutes. Using Python, AssemblyAI, DeepL and OpenAI

5 min readMay 2, 2024

Here is a quick tutorial how to create a real-time language translation code from English to basically any popular language. Lots of integrations but still a single file code. You can skip directly to the code here.

📘 How it works

My goal was to build a real-time solution that translates from Egnlish to Polish language. Ideally a speech as the result. After some investigation I found out that there are several ML/AI services that do the job near perfect and pretty fast:

AssemblyAI for real-time text-to-speech conversion;
DeepL for a quality text translation;
OpenAI for an expert speech-to-text.

Solution in action:

What‘s going on inside:

The Python code grabs speech via mic and sends to AssemblyAI.
Once chunk of text is formed it’s then sent to DeepL for a translation.
Once translated text is ready, OpenAI service is asked to form an mp3 file.
And finally, mp3 file is played in speakers.

👨‍💻 The code

It will take more time to register accounts in all 3 services and generate API keys than write code actually. Here down below I presume that you have your keys in hands already.

Setup routine

Download and install Python if you do not have one: https://www.python.org/downloads/. Probably it’s already delivered by your OS already. Check what’s the current version in console:

python --version

It should be at least 3.10. The code below was written in 3.11.

Setup a dedicated virtual environment for the project, for all dependencies to be installed locally.

python -m venv env
source env/bin/activate

Now let’s install all dependencies for the project. You can find requirements.txt in here.

pip i -r requirements.txt

Write all API key values in the following format in .env file. It will store secret credentials independently from the code:

OPENAI_API_KEY=your_value
DEEPL_API_KEY=your_value
ASSEMBLY_API_KEY=your_value

Main application

It’s located in main.py — the default starting point for apps in Python. At first, import all necessary libs:

# OpenAI SDK
import openai
# OS lib for reading secret keys fron environment
import os
# lib for reading secret keys from .env and saving in environment
from dotenv import load_dotenv
# AssemblyAI SDK
import assemblyai as aai
# DeepL SDK
import deepl
# libraries needed to play mp3 locally
from pydub import AudioSegment
from pydub.playback import play
# inbuilt package to measure time
import time

At second, initialise secret values (store as environment variables), init API keys and OpenAI client:

load_dotenv()

translator = deepl.Translator(os.environ["DEEPL_API_KEY"])
aai.settings.api_key = os.environ["ASSEMBLY_API_KEY"]
openai.api_key = os.environ["OPENAI_API_KEY"]
client = openai.OpenAI()

Write standalone functions. For generation of a mp3 file by connecting to OpenAI. Please note, here we use voice “nova” and standard quality “tts-1”. When audio is done we can see how much seconds it took to generate:

def gen_speech_file(speech_file_path, text):
    st = time.time()
    response = client.audio.speech.create(
        model="tts-1",
        voice="nova",
        input=text
    )
    response.stream_to_file(speech_file_path)
    print('to speech for:', (time.time() - st), 'sec')
    return speech_file_path

Function for playing any mp3 file using speakers:

def play_audio(speech_file_path):
    audio_clip = AudioSegment.from_mp3(speech_file_path)
    play(audio_clip)

Now it’s time for the main logic. It’s located in a single function. It’s called in case when AssemblyAI is ready with the full phrase just pronounced or a part of it. We don’t want to react on each new word but start acting only AssemblyAI is finalised the phrase conversion:

def on_data(transcript: aai.RealtimeTranscript):

    if not transcript.text:
        return

    # if the sentence is final, let's act
    if isinstance(transcript, aai.RealtimeFinalTranscript):
        # ask DeepL to bring the translation to Polish language
        result = translator.translate_text(transcript.text, target_lang="PL")
        print(transcript.text, end="\r\n")
        print("PL: " + result.text)

        # call function to generate and audio file
        gen_speech_file("speech.mp3", result.text)
        # play it in speakers
        play_audio("speech.mp3")
    else:
        print(transcript.text, end="\r")py

Since AssemblyAI SDK is an event reacting system, we need to define several utility methods:

def on_open(session_opened: aai.RealtimeSessionOpened):
    print("Session ID:", session_opened.session_id)

def on_close():
    print("Closing Session")

def on_error(error: aai.RealtimeError):
    print("An error occured:", error)

on_open/on_close are called to wrap the translation session. on_error is called when smth unexpected happens.

And now, it’s time to define main AssemblyAI transcription object and run it:

# init the main object
transcriber = aai.RealtimeTranscriber(
    on_data=on_data,
    on_error=on_error,
    sample_rate=44_100,
    on_open=on_open,
    on_close=on_close
)

# connect to AssemblyAI servers
transcriber.connect()

# open a stream for mic
microphone_stream = aai.extras.MicrophoneStream()

# stream the audio into the main object
transcriber.stream(microphone_stream)
transcriber.close()

Again, the full code can be found in this repository.

🏁 Now it’s time to run and enjoy your personal real-time translator!

python3 main.py

🤗 Results

Transcription from speech to text and automatic translation works very quickly. Modern neural models have great accuracy. And it operates really like a human interpreter — AssemblyAI returns text portions once it’s sure the sentence is complete and well understood. And thanks to the great speed of both services (AssemblyAI + DeepL) the translation appears in a wink of an eye.

But text-to-speech is a bit laggy. It works pretty good when a speaker makes pauses between words, sentences. But when pronunciation speed is higher than some threshold, the sound starts lagging. And I can understand that — it’s a tough task. And for the time of the article — only OpenAI is supporting this wide variety of output languages.

💰 The cost:

$0.47 for 1 hour of real-time recognition by AssemblyAI;
500k characters free p/month for DeepL and then $25 for 1M characters more;
$15 for 1M characters when generating a voice by OpenAI.

Example: the non-stop 1 hour speech will cost $1.64 below first 500k. And $3.2 plus $5 monthly fee for text > 500k.

This level of automation wasn’t possible a year or two ago. But thanks to fast AI services and the ease of use of Python you now can build your own customised online translators literally in minutes 🥳

Photo by Nick Fewings on Unsplash