Sitemap
Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Real Time Audio to Audio Streaming with Googles Multimodal Live API

9 min readMay 8, 2025

--

Welcome back. In the previous article, we’ve talked about the capabilities of Google’s Multimodal Live API, mainly how to get it to generate text and spoken audio from text inputs.

This is the second part of a series. If you haven’t read the first part, I recommend doing that before you dig into this one.

We’re taking a big leap forward by building a full-fledged, two-way audio communication with the Live API. Imagine speaking to an AI by capturing your voice from a microphone and having it talk back, like a natural conversation. That’s precisely what we’re aiming for in this article. And trust me, it will be a hell of a fun.

Grab your headsets, because we are about to make your AI applications truly listen, speak back, and even handle interruptions. So yes you can finally say shut up to your AI.

The Core Idea a Bidirectional Audio Stream

To achieve a natural voice conversation, our application needs to perform several tasks simultaneously:

  1. Capture Audio
    Continuously listen to your microphone.
  2. Stream to Gemini
    Send your spoken audio to the Live API in real-time.
  3. Receive from Gemini
    Get the AI’s spoken response as an audio stream.
  4. Playback Audio
    Play Gemini’s voice through your speakers.
  5. Handle Interruptions
    If you start speaking while Gemini is talking, it should politely stop and listen.

This might sound complex, but by breaking it down, we’ll see how the Google Gen AI SDK and asyncio make it manageable.

Want to see a little sneak peek demo?

more is coming

A little recap if you still didn’t read the first article

We use the LiveConnectConfig object, much like in the previous, but this time we're explicitly setting up for audio input and output.

from google import genai
from google.genai.types import (
LiveConnectConfig,
SpeechConfig,
VoiceConfig,
PrebuiltVoiceConfig,
)

# Your Project ID and Location (if using Vertex AI)
PROJECT_ID = "your-google-cloud-project-id" # Or your Gemini API Key
LOCATION = "us-central1" # Or your region
MODEL_ID = "gemini-2.0-flash-live-preview-04-09" # Model optimized for live interaction

# Initialize the client (check out the previous article for Vertex AI vs. API Key options)
client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

CONFIG = LiveConnectConfig(
response_modalities=["AUDIO"], # We want spoken responses
speech_config=SpeechConfig(
voice_config=VoiceConfig(
# Choose a voice you like!
prebuilt_voice_config=PrebuiltVoiceConfig(voice_name="Puck")
)
),
# You can give your AI a personality!
system_instruction="you are a super friendly, sometime a bit to friendly assistant",
)

Managing Audio I/O with our AudioManager

To keep our audio input and output logic clean and organized, I create an AudioManager class. This class will be responsible for:

  • Initializing PyAudio for microphone input and speaker output.
  • Buffering and playing audio chunks received from Gemini.
  • Handling interruptions.

Let’s look at its structure piece by piece.

First, we define some constants for our audio streams and import the necessary libraries:

import pyaudio
from collections import deque
import asyncio

FORMAT = pyaudio.paInt16 # Audio format: 16-bit PCM
SEND_SAMPLE_RATE = 16000 # Sample rate for audio sent to Gemini (Hz)
RECEIVE_SAMPLE_RATE = 24000 # Sample rate for audio received from Gemini (Hz)
CHUNK_SIZE = 512 # Size of audio chunks to process
CHANNELS = 1 # Mono audio

Next, we start defining the AudioManager class.

class AudioManager:

The initialize method is an asynchronous function that sets up the actual microphone input and speaker output streams using PyAudio. It uses asyncio.to_thread to run the blocking PyAudio calls in a separate thread, preventing them from halting our main asynchronous operations.

async def initialize(self):
mic_info = self.pya.get_default_input_device_info()
print(f"microphone used: {mic_info}")

self.input_stream = await asyncio.to_thread(
self.pya.open,
format=FORMAT,
channels=CHANNELS,
rate=self.input_sample_rate,
input=True,
input_device_index=mic_info["index"],
frames_per_buffer=CHUNK_SIZE,
)

self.output_stream = await asyncio.to_thread(
self.pya.open,
format=FORMAT,
channels=CHANNELS,
rate=self.output_sample_rate,
output=True,
)

The add_audio method is responsible for taking audio data received from Gemini and adding it to our audio_queue. If there isn't an active playback task, it creates one to start playing the queued audio.

def add_audio(self, audio_data):
"""Adds received audio data to the playback queue."""
self.audio_queue.append(audio_data)
# If playback isn't running, start it
if self.playback_task is None or self.playback_task.done():
self.playback_task = asyncio.create_task(self._play_audio())

The _play_audio method is an asynchronous helper that continuously takes audio chunks from the audio_queue and writes them to the output stream (your speakers) until the queue is empty. Again, asyncio.to_thread is used for the blocking write call.

async def _play_audio(self):
"""Plays audio chunks from the queue."""
print("🗣️ Gemini talking...")
while self.audio_queue:
try:
audio_data = self.audio_queue.popleft()
await asyncio.to_thread(self.output_stream.write, audio_data)
except Exception as e:
print(f"Error playing audio: {e}")
break # Stop playback on error
print("Playback queue empty.")
self.playback_task = None # Reset task when done

The interrupt method is crucial for handling user interruptions. When called, it clears the audio_queue (so any pending audio from Gemini isn't played) and cancels the ongoing playback_task.

 def interrupt(self):
"""Handle interruption by stopping playback and clearing queue"""
self.audio_queue.clear()
self.is_playing = False

# Important: Start a clean state for next response
if self.playback_task and not self.playback_task.done():
self.playback_task.cancel()

The main audio_loop

Now, let’s orchestrate the entire logic with an asynchronous function, audio_loop.
This function is the heart of our audio-to-audio application. It initializes our AudioManager, connects to the Gemini Live API, and then uses asyncio.TaskGroup() to manage three crucial concurrent tasks:

  1. listen_for_audio()
    Captures audio from your microphone.
  2. process_and_send_audio()
    Sends your captured audio to Gemini.
  3. receive_and_play()
    Receives Gemini's audio response and plays it.

These tasks must run concurrently (at the same time) because a natural conversation involves listening, speaking, and processing responses simultaneously. If we waited for one task to finish before starting the next (running them sequentially), the interaction would be stilted and unnatural . Imagine having to wait for the AI to completely finish speaking before your microphone even starts listening. asyncio allows these operations to interleave, creating the illusion of simultaneous activity, which is vital for a low-latency, real-time experience.

The asyncio.TaskGroup() will help us run our audio handling tasks concurrently. We will dig deeper into each of the three tasks that we need.

async def audio_loop():
audio_manager = AudioManager(
input_sample_rate=SEND_SAMPLE_RATE, output_sample_rate=RECEIVE_SAMPLE_RATE
)

await audio_manager.initialize()

async with (
client.aio.live.connect(model=MODEL, config=CONFIG) as session,
asyncio.TaskGroup() as tg,
):

# ... tasks will be defined and started here ..


# Start all tasks with proper task creation
tg.create_task(listen_for_audio())
tg.create_task(process_and_send_audio())
tg.create_task(receive_and_play())

We also set up an asyncio.Queue called audio_queue. This queue acts as a buffer between capturing your voice and sending it to Gemini, which helps in managing the audio flow smoothly.

audio_queue = asyncio.Queue()

Task 1: Capturing Your Voice

This asynchronous function continuously reads audio data from your microphone (via the audio_manager) in small chunks.

Each chunk is then put into the user_audio_input_queue to be processed by the next task.

async def listen_for_audio():
"""Just captures audio and puts it in the queue"""
while True:
data = await asyncio.to_thread(
audio_manager.input_stream.read,
CHUNK_SIZE,
exception_on_overflow=False,
)
await audio_queue.put(data)

Task 2: Sending Your Voice to Gemini

This task waits for audio chunks to appear in the aduio_queue. Once a chunk is available, it sends it to the Gemini Live API using session.send_realtime_input().

The send_realtime_input method is specifically designed for streaming media like audio chunks with low latency, essential for a conversational feel. user_audio_input_queue.task_done() signals that the item from the queue has been processed.

async def receive_and_play():
while True:

async for response in session.receive():
server_content = response.server_content

if (
hasattr(server_content, "interrupted")
and server_content.interrupted
):
print(f"🤐 INTERRUPTION DETECTED")
audio_manager.interrupt()

if server_content and server_content.model_turn:
for part in server_content.model_turn.parts:
if part.inline_data:
audio_manager.add_audio(part.inline_data.data)

if server_content and server_content.turn_complete:
print("✅ Gemini done talking")

Task 3: Hearing Gemini's Response

This is where we listen for and process Gemini’s responses. The function iterates asynchronously and continuously (therefore the while True) through messages received from the session.receive() stream.

We check if response.server_content.interrupted is True. If so, it means Gemini's Voice Activity Detection (VAD) on the server-side detected you speaking and stopped its own output (more on that later in the article). We then call our local audio_manager.interrupt() to clear any of Gemini's audio that's already in our playback queue.

If there’s audio data in part.inline_data.data, we pass it to audio_manager.add_audio() for playback

The server_content.turn_complete flag signals that Gemini has finally finished talking =).

async def receive_and_play():
while True:

async for response in session.receive():
server_content = response.server_content

if (
hasattr(server_content, "interrupted")
and server_content.interrupted
):
print(f"🤐 INTERRUPTION DETECTED")
audio_manager.interrupt()

if server_content and server_content.model_turn:
for part in server_content.model_turn.parts:
if part.inline_data:
audio_manager.add_audio(part.inline_data.data)

if server_content and server_content.turn_complete:
print("✅ Gemini done talking")

Running the Show

To bring it all to life, run the complete Python script. The AudioManager is designed to detect and use your system's default microphone automatically.

🎙️ You can start speaking now, listen to the wonderful AI voice, or rudely interrupt your AI.

But wait, there is a little more on interruptions.

How Interruptions Work with VAD

A key feature of the Gemini Multimodal Live API (I assume i’ts now just called Live API) is its ability to detect when a user speaks while the model is generating a response. This is often referred to as Voice Activity Detection (VAD).

Here’s how it generally works:

  • User Interrupts
    You can interrupt the model’s output at any time by simply starting to speak.
  • VAD on the Server
    The Live API’s VAD system detects this new voice activity from your incoming audio stream.
  • Generation Canceled
    Once the server detects an interruption, the ongoing generation of the model’s current response is canceled and discarded. This means Gemini stops “thinking” about what it was going to say next.
  • History Retained
    Only the information already sent to the client (i.e., the audio chunks of Gemini’s voice that you’ve already received) is retained in the session history. The parts of the response that were cut off are not.
  • Server Notification
    The server then sends a message to your application, specifically flagging that an interruption occurred. You'll see this in the response stream, and this is what we use to properly handle that in our application.
async for response in session.receive():
if response.server_content and response.server_content.interrupted is True:
# The generation was interrupted by the user.
audio_manager.interrupt()

So, the Gemini Live API handles the detection of the interruption and the stopping of its own generation. Our AudioManager's interrupt() method is then responsible for the application-side cleanup by primarily, clearing its own playback queue so that your application doesn't continue to play out the already-received (now irrelevant) audio chunks from Gemini.

Why This is So Cool

This is more than just a text-to-speech and speech-to-text pipeline glued together. It’s a genuinely interactive voice experience:

True Bidirectional Streaming with audio flows continuously in both directions, creating a fluid conversation.

The delay between you speaking and Gemini responding is minima, true low latency.

Interruptibility it listens and reacts if you interject, making the conversation feel far more natural and human-like.

Get The Full Code 💻

Want to dive straight into the complete, runnable code? You can find the entire Python script for this audio-to-audio implementation, including the AudioManager and the audio_loop we've discussed, in this GitHub repository:

No comments, no readme, consider this article the documentation.

What’s Next in this series of articles?

With this audio-to-audio solution, we’re much closer to building that helpful assistant for assembling items from the big 🇸🇪 Swedish furniture store where you always buy a candle.

Imagine:

  • You:
    “Okay Gemini, I’m stuck on step 5, the part with the cam locks.” (You’re sending audio).
  • Gemini:
    “Alright, step 5 can be a bit tricky! Are you looking at diagram B on page…” (Gemini is sending audio).
  • You:
    (Interrupting) “Wait, diagram B? I thought it was diagram C.”
  • Gemini:
    (Stops talking) “Here you see!. Are you using the cam locks at the right place?”

This kind of natural, interruptible dialogue is now possible.

From here, we will explore in the next parts of this series on how to:

  • Integrating video input to show Gemini the furniture parts.
  • Using the Live API’s “Tools” feature to have Gemini fetch specific instructions.

You Made It To The End! (Or Did You Just Scroll Down?)

Either way, I hope you enjoyed the article.

Got thoughts? Feedback? Discovered a bug while running the code? I’d love to hear about it.

  • Connect with me on LinkedIn. Let’s network! Send a connection request, tell me what you’re working on, or just say hi.
  • AND Subscribe to my YouTube Channel ❤️

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Sascha Heyer
Sascha Heyer

Written by Sascha Heyer

Hi, I am Sascha, Senior Machine Learning Engineer at @DoiT. Support me by becoming a Medium member 🙏 bit.ly/sascha-support