Google Multimodal Live API Audio Transcription
Add real-time input and output audio transcription to your AI applications.
Welcome back to this Multimodal Live API article series. 📚 To explore previous articles in this series, head over to the article series overview.
This article dives into how to use the Live API to get real-time text transcriptions of your spoken input and Gemini’s audio responses. These features make your AI interactions more transparent, more accessible, and significantly more powerful. Want to know why?
Why Transcribe?
You might be thinking, “I can hear it, why do I need to read it too?” Great question. But think about it, adding transcription to your live, multimodal applications opens up a bunch of benefits:
- Many users are used to a familiar chat experience when interacting with AI. Providing both spoken audio and a written transcript offers a comfortable way for users to listen, read, or both.
- Transcription significantly improves accessibility, offering a text alternative for users who are deaf, hard of hearing, or in noisy environments.
- It allows for easy record keeping, as conversations can be logged for later review, analysis, or compliance, which is vital for customer support or important meetings (hello EU AI act).
- Text transcripts boost searchability, enabling users to locate specific parts of a past conversation quickly.
- For developers, transcripts are invaluable for debugging and analysis, providing clear insight into what the user said and how the model responded.
- Transcribed text can be a valuable asset for feeding other systems, serving as input for further AI models or workflows like summarization, translation, or sentiment analysis of your customer conversations.
With the Live API, getting these transcripts allows us to extend our application with a lot more valuable and useful features.
You can see the transcriptions of a live conversation in this demo.
Getting Gemini’s Speech Transcription
Let’s start with the most common scenario. You want a text version of what Gemini says in its audio response.
Remember our LiveConnectConfig
from the previous articles?
When setting up your connection, you’ll modify your configuration and enable output transcriptions with Live API.
That little output_audio_transcription={}
is the key. By including this you're signaling to the API that you want transcripts for the audio it generates.
config = LiveConnectConfig(
...
# Enable output transcription
output_audio_transcription={}
)
Now, how do you get this transcript? It arrives alongside the audio data in the messages you receive from the server. You’ll check for it in your session.receive()
loop:
# You are inside an asynchronous loop,
# specifically `async for response in session.receive():`.
output_transcription = response.server_content.output_transcription
if output_transcription and output_transcription.text:
output_transcriptions.append(output_transcription.text)
print(f"Gemini's Transcript: {transcript_text}")
As Gemini speaks, you’ll get the audio chunks to play and the corresponding text transcription. This text arrives in chunks, mirroring the spoken words, giving you a real-time textual feed of the conversation.
Getting Your Speech Transcription
Okay, so we can get Gemini’s words in text. But what about transcribing your voice input? This is also possible and useful, especially for confirming what the AI “heard” or keeping a complete conversational log.
Enabling input transcription is similar to output transcription. You’ll specify it in your connection configuration.
config = LiveConnectConfig(
...
# Enable output transcription
output_audio_transcription={}
# Enable input transcription
input_audio_transcription={}
)
Once enabled, the input transcriptions will also be sent back to the messages from the server. You’d access it directly from the response
object, same as we did with the output_transcription.
# You are inside an asynchronous loop,
# specifically `async for response in session.receive():`
input_transcription = response.server_content.input_transcription
if input_transcription and input_transcription.text:
input_transcriptions.append(input_transcription.text)
print(f"Your Transcript: {transcript_text}")
With input and output transcription enabled, your AI assistant becomes even more robust. Now, you can have a bidirectional voice conversation and get a fulll, real-time textual log of the entire interaction.
Side note: During testing, I discovered that the response sometimes contains a <noise> flag. I don't know if this is implemented on the Live API side, if this is a hallucination from the model, or if the model can naturally recognize voice. If you have more information, please ping me. I would like to know if we could utilize this for further analysis.
Input transcription: <noise> Hello.
Check out the code (audio-to-audio.py) to see this in action ⬇️⬇️⬇️. Transcription output will be in the server terminal logs. In a later article, we will use that in our frontend implementation. So stick around for the next article in this series.
Get The Full Code đź’»
The concepts and snippets discussed here will be integrated into our ongoing project. You can find the complete Python scripts and follow along with the developments in our GitHub repository:
You Made It To The End. (Or Did You Just Scroll Down?)
Either way, I hope you enjoyed the article.
Got thoughts? Feedback? Discovered a bug while running the code? I’d love to hear about it.
- Connect with me on LinkedIn. Let’s network! Send a connection request, tell me what you’re working on, or just say hi.
- AND Subscribe to my YouTube Channel ❤️