Sitemap
Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

A Developers Guide to Googles Multimodal Live API

7 min readMay 8, 2025

--

We’re moving beyond simple text prompts into truly dynamic, real-time voice interactions.

AI that can understand your voice, see what you see through your camera, and respond instantly, just like a natural conversation. And yes, you can finally do the rude thing and interrupt your AI. This is what's possible with Google's Multimodal Live API.

If this is all new for you, please watch the two videos below. The first one shows how Google puts this into production at a large scale with Gemini Live on Google's Pixel 9. And the second one is Patrick Marlow at Google Cloud Next 25, showcasing a voice-driven agent for an e-commerce website.

But how do you build such an application? What technical pieces are involved in handling live audio and video streams, managing conversation turns, and integrating external tools?

I am frank with you. Building a bi-directional multimodal streaming application is more complex. But don’t worry. I've got you. We do this step by step.

Before we get into any code and the API itself, head to https://aistudio.google.com/live and try it. It is amazing. If you are convinced, come back and I will tell you how to implement this into your applications.

What are we building in this series?

This is a series of articles with this one beeing the first. We build up step by step until we have a final full working solution. So stick around.

Probably every one of you has already built some furniture from this big Swedish company that also sells a lot of candles? And while everyone always complains that parts are missing, I have never had this issue. However, I sometimes get stuck in the process of building the furniture.

Let us re-imagine the support of that 🇸🇪 company by building an application that takes the furniture instructions and allows us to live stream audio and video of our building process and help us along the way with voice tips and instructions.

Little sneak peek of what is coming in this series. Follow me on LinkedIn if you don’t want to miss it.

The API in a Nutshell

It’s Bi-Directional Streaming, meaning data flows continuously in both directions.

From you to the AI and back again over an open connection. This is what creates that low-latency, conversational feel. You can send information, the AI can start responding, and you can even jump in with more data or interrupt it, just like a real chat.

And it’s Multimodal, which is the fancy way of saying it handles multiple types of data.

For input, you can feed the API good old typed messages (Text), your voice live (Audio), and even live streams from a camera or your screen (Video).

For output, the API can respond with classic written responses (Text) and natural-sounding spoken replies (Audio).

I always say keep things simple until you make them complicated, and that's precisely what we do.

Let's start with the fun part and let the API generate text and audio/voice.

A Quick Word on async and await

You will notice the code below uses async and await. This is Python's way of handling asynchronous operations.

Why do we need this?

When our program communicates over a network (like talking to the Gemini API), it often waits for a response. Instead of freezing everything while waiting (which is what traditional, synchronous code does), it async/await allows the program to work on other tasks and then come back to the network operation once it's ready. This is crucial for responsive applications, especially for a "Live" API where we want quick back-and-forth interactions.

Getting Started with the Google Gen AI SDK, and yes, you will hear Gemini speak already, stay tuned.

The simplest way to start interacting with the Multimodal Live API, especially for features like text-to-speech or essential text generation, is by using the official Google Generative AI SDK for Python. It handles a lot of the underlying complexity for you.

1. Initializing the Client (Vertex AI vs. Gemini API Key)

This is all nicely packed into the new google-genai library. You need to install the library.

pip install -U -q google-genai

Next, you’ll import the library into your Python script and set up some essential configuration variables. Please pay attention to the model we use gemini-2.0-flash-live-preview-04-09 a model optimized for live interaction.

from google import genai

PROJECT_ID = "sascha-playground-doit"
LOCATION = "us-central1"
MODEL_ID = "gemini-2.0-flash-live-preview-04-09"

The SDK can connect to Google’s AI services in two main ways:

Through Vertex AI (Recommended for Google Cloud users)
This method uses your Google Cloud project credentials (usually set up via gcloud auth application-default login). It's the standard way to use Gemini models within the Google Cloud ecosystem. The recommended way for companies.

Using a Gemini API Key (For Google AI Studio users)
If you’re working primarily through Google AI Studio, you might have an API key. If you are using Google AI Studio, you must use a different model ID: gemini-2.0-flash-exp.

# option 1 Google Cloud
client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

#option 2 Google AI Studio
client = genai.Client(api_key=GEMINI_API_KEY)

So far, nothing has been involved related to the Live API, so let's change this.

2. Configure Text or Audio

Before establishing the connection and sending/receiving data, we need to tell the API how we want the interaction to behave. This is done using the LiveConnectConfig object from the SDK.

Remember how we mentioned that the API can output text or audio (voice)?

If you want the text-based experience, you configure it like this:

from google.genai.types import LiveConnectConfig

# For TEXT responses
text_config = LiveConnectConfig(
response_modalities=["TEXT"]
)

If you want Gemini to speak, you need to specify "AUDIO" as the modality and provide some speech configuration details:

from google.genai.types import (
LiveConnectConfig,
SpeechConfig,
VoiceConfig,
PrebuiltVoiceConfig
)

voice_name = "Aoede"

# For AUDIO responses
audio_config = LiveConnectConfig(
response_modalities=["AUDIO"],
speech_config=SpeechConfig(
voice_config=VoiceConfig(
prebuilt_voice_config=PrebuiltVoiceConfig(voice_name=voice_name)
)
),
)

Here, response_modalities=["AUDIO"] requests voice output. Even though the SDK indicates a list of the response modalities as of May 2025, it only supports one modality at a time.

The speech_config allows you to customize the voice (voice_name).

We’ll use this config object (either text_config or audio_config) when we establish the connection in the next step.

3. Connecting and Sending Your First Message

With our client initialized and connection configured (let’s assume we’re using text_config For this example, we're ready to open the communication channel and send our first piece of input.

from google.genai.types import Content, Part

text_input = """I am visiting Berlin in June
what would you recommend to do there?"""

# Establish the asynchronous connection using the client and configuration
# 'async with' ensures the connection is closed properly afterwards
async with client.aio.live.connect(model=MODEL_ID, config=text_config) as session:

# Send the user's text input to the API
await session.send_client_content(
turns=Content(role="user", parts=[Part(text=text_input)])
)

First, we need to establish the connection to the Live API.
That happens with async with client.aio.live.connect(...) as session:.

This uses our initialized client which specifies the model, and passes our connection config (text_configor audio_config).

Thesession object is the active connection and ensures it's properly closed later.

And secondly we use the session and send our input, structured as a user turn containing the text Partto the API using await session.send_client_content(...).

After this code runs, the connection is open, and our message is on its way to Gemini. The next step, which we’ll cover shortly, is to listen for the response coming back through the same session.

4. Receiving and Processing Responses

Inside the session.receive() loop, we need to check the message to either process the audio or text response from the Live API.

It's easy for text, but audio data is nested deeper within the message structure, and additionally, we need to work with audio chunks.

audio_data = []

async for message in session.receive():

# Handle Text output
if message.text:
print(message.text, end="", flush=True)

# Handle Audio output
if message.server_content.model_turn and message.server_content.model_turn.parts:
for part in message.server_content.model_turn.parts:
if part.inline_data:
audio_data.append(np.frombuffer(part.inline_data.data, dtype=np.int16)

For this example, we can combine and save the chunks into an audio file.

if audio_data:
audio = np.concatenate(audio_data)
sf.write("output.wav", audio, 24000)

This is a generated voice response based on our text input from above.

What’s Next in this series of articles?

Alright, you’ve dipped your toes into Google’s Multimodal Live API.
But let’s be honest, making an AI say “hello” is not exciting enough.

In Part 2 of this series we are implementing a full Audio-to-Audio Conversation, a proper real time voice chat.

Ever wanted to interrupt an AI without feeling rude? We’ll show you how the Live API handles natural interruptions, making conversations feel incredibly fluid.

We’ll get even cozier with Python’s asynchronous capabilities to manage the real-time audio streams.

Head over to the second article:

Get The Full Code 💻

Want to dive straight into the complete, runnable code? You can find the entire Python script for this article series on GitHub (we are adding more code along the way).

You Made It To The End! (Or Did You Just Scroll Down?)

Either way, I hope you enjoyed the article.

Got thoughts? Feedback? Discovered a bug while running the code? I’d love to hear about it.

  • Connect with me on LinkedIn. Let’s network! Send a connection request, tell me what you’re working on, or just say hi.
  • AND Subscribe to my YouTube Channel ❤️

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Sascha Heyer
Sascha Heyer

Written by Sascha Heyer

Hi, I am Sascha, Senior Machine Learning Engineer at @DoiT. Support me by becoming a Medium member 🙏 bit.ly/sascha-support