Combining Vertex AI Live API with a User Interface
From Local Python script to Full Application: Integrating Frontend, Backend, and WebSockets
Welcome back. In the second article of this series we implemented real-time, bidirectional audio-to-audio conversation purely in Python. But let’s be honest, running a Python script from the command line isn’t exactly how you’d deploy a user-facing application.
How do we put that cool audio interaction behind a user-friendly interface that anyone can access through their browser?
That’s precisely what we’re tackling today.
We’re moving beyond the local machine and building a complete web application. This involves:
- A frontend web page (HTML, CSS, JavaScript) that runs in the user’s browser, capturing their voice and playing back the AI’s response.
- A Backend Python server that acts as the bridge between the user’s browser and the powerful Google Live API.
- WebSockets enable us to implement real-time, two-way data flow between the frontend and backend.
Remember our helpful assistant for assembling furniture from that big 🇸🇪 Swedish store? We’re now building the web interface for that assistant,
Quick side note: The sleek, Swedish furniture store-inspired UI you’ll see in the code was actually vibe-coded entirely using Google Gemini Canvas with the Gemini 2.5 Pro model! It’s incredible how quickly you can prototype these days.
Why a Web Interface? And Why WebSockets?
While our previous Python script worked, it limited actual production use. A web application provides a familiar browser interface for everyone. However, standard HTTP requests aren’t suitable for real-time audio streaming due to their request-response nature. We need a persistent, two-way communication channel.
This is where WebSockets come in. They establish a continuous connection between the browser (frontend) and our Python server (backend), allowing audio data to flow freely in both directions without the latency of repeated HTTP requests. This is essential for the fluid, conversational experience we need.
For an introduction to WebSockets, check out my previous article.
The New Architecture: Client, Server, and API
Our application now has three main parts working together:
Frontend (index.html
, audio-client.js
)
The Frontend is responsible for the user interface, displaying the chat messages and buttons. It utilizes the browser’s built-in Web Audio API for capturing audio from the microphone and for playing back the audio responses received from the server. Communication with the backend happens exclusively via WebSockets, sending the user's audio and receiving the AI's audio and text responses.
Backend (server.py
)
The Backend runs a WebSocket Server using WebSockets to listen for browser connections. For each user, it acts as a Live API Client, connecting to the Google Multimodal Live API via the google-genai
SDK. It functions as a Proxy/Bridge, relaying audio data from the browser's WebSocket to the Live API, and forwarding responses (audio, text, status) from the Live API back to the correct browser. It also handles Session Management, supporting multiple simultaneous users and utilizing the Live API's session resumption capabilities.
Google Multimodal Live API
This remains the core engine, a dedicated API + multimodal model. It processes the incoming audio and and video stream, performs reasoning and generation, and produces the audio response stream. Beyond this core audio interaction, the API offers features like interruption detection (allowing users to speak over the AI), session management (for maintaining context across interactions or reconnections), input and output transcriptions (providing text versions of the conversation), and tool/function calling capabilities (enabling the AI to interact with external systems), many of which we explore throughout this series.
Key Changes From PyAudio to Web Audio
The most significant shift is how we handle audio. In our previous implementation from the second article in the series, our Python script used the PyAudio
library and an AudioManager
class to directly interact with the microphone and speakers on the machine running the script. In this web application, the responsibilities change:
- Audio Input/Output Moves to the Browser: The
audio-client.js
script running in the user's browser now handles all audio capture and playback using the standard Web Audio API. This is a huge advantage as it removes the need for special audio libraries (likePyAudio
) on the server, leverages the browser's built-in capabilities for a simpler backend, and works across different operating systems as long as the browser supports the Web Audio API. - Server Becomes a Relay: The Python
server.py
no longer deals with raw audio hardware. Its job is purely to relay data: receive audio chunks (encoded as Base64 strings) from the browser via WebSocket, send these chunks to the Live API, receive audio chunks back from the Live API, and encode these chunks (again, Base64) to send them back to the browser via WebSocket. - WebSocket Communication: All interaction between the user’s interface and our Python logic now happens over the persistent WebSocket connection, enabling the real-time flow.
- UI Integration: We now have
index.html
providing the visual chat interface, styled with Tailwind CSS, making the interaction much more intuitive than the previous command-line approach.
Conclusion: A More Realistic Implementation
We’ve successfully transitioned our real-time audio AI from a local Python script to a fully functional web application. By leveraging WebSockets for communication and the browser’s Web Audio API for audio handling, we’ve created a more accessible, scalable, and user-friendly experience.
The backend server efficiently bridges the powerful Google Live API, managing connections and relaying data.
This architecture is much closer to how you’d build and deploy such features in a production environment. In the upcoming articles, we will talk about deploying this onto Google Cloud and if you want to know how to combine this with Googles Agent Development Kit check this out:
To explore previous and future articles in this series, head over to the article series overview.
Get The Full Code 💻
The concepts and snippets discussed here will be integrated into our ongoing project. You can find the complete Python scripts and follow along with the developments in our GitHub repository:
You Made It To The End. (Or Did You Just Scroll Down?)
Either way, I hope you enjoyed the article.
Got thoughts? Feedback? Discovered a bug while running the code? I’d love to hear about it.
- Connect with me on LinkedIn. Let’s network! Send a connection request, tell me what you’re working on, or just say hi.
- AND Subscribe to my YouTube Channel ❤️