Building a Real-Time Voice Assistant Application with FastAPI ,Groq and OpenAI TTS Api

Plaban Nayak
The AI Forum
Published in
12 min readDec 1, 2024
Application Workflow

Introduction

In this article, I’ll walk you through building a sophisticated voice chat application that combines real-time audio processing, speech recognition, natural language processing, and text-to-speech synthesis. The application uses FastAPI for the backend and integrates multiple AI services to create a seamless voice-based interaction system.

Technical Architecture

Core Technologies

  • FastAPI: A modern, fast web framework for building APIs with Python
  • WebSocket: For real-time bidirectional communication
  • Groq API: For speech-to-text and chat completions
  • OpenAI API: For text-to-speech synthesis
  • Custom Voice Detection: For real-time voice activity detection

The Processing Pipeline

Our application implements a sophisticated processing pipeline:

  1. Voice Activity Detection

The system continuously monitors audio input for voice activity, efficiently managing audio streaming. Voice Activity Detection (VAD) is a crucial component in our real-time voice chat application. It’s responsible for distinguishing between speech and non-speech segments in an audio stream, allowing the application to process only meaningful voice input.

How Voice Detection Works

* Frame-Based Analysis

The system processes audio in small frames (typically 10–30ms each) to make real-time decisions about voice activity. Each frame is analyzed for:

  • Energy levels
  • Frequency characteristics
  • Zero-crossing rate
  • Spectral features

* Silence Detection

The system:

  • Tracks periods of silence
  • Uses a configurable threshold (1.5 seconds)
  • Maintains a running buffer of audio data

* Buffer Management

Key aspects:

  • Dynamic buffer sizing
  • Efficient memory management
  • Real-time processing capabilities
   voice_detector = VoiceDetector()
voice_detected = voice_detector.detect_voice(data)

2. Speech-to-Text (Groq Whisper)

Audio is processed through Whisper model for accurate transcription using Groq api

async def transcribe_audio(audio_data: bytes):
"""Transcribe audio using Groq's Whisper model"""
temp_wav = None
try:
# Create a unique temporary file
temp_wav = tempfile.NamedTemporaryFile(suffix='.wav', delete=False)
wav_path = temp_wav.name
temp_wav.close() # Close the file handle immediately

# Write the WAV file
with wave.open(wav_path, 'wb') as wav_file:
wav_file.setnchannels(1) # Mono
wav_file.setsampwidth(2) # 2 bytes per sample (16-bit)
wav_file.setframerate(16000) # 16kHz
wav_file.writeframes(audio_data)

# Transcribe using Groq
with open(wav_path, 'rb') as audio_file:
response = await groq_client.audio.transcriptions.create(
model="whisper-large-v3-turbo",
file=audio_file,
response_format="text"
)

return response

3. Response Generation

The transcribed text is processed through llama-3.1–70b-versatile using Groq api for synthesizing responses for the instruction provided.

async def get_chat_response(text: str):
"""Get chat response from Groq"""
try:
response = await groq_client.chat.completions.create(
model="llama-3.1-70b-versatile",
messages=[
{"role": "system", "content": "You are a helpful assistant. Please provide a clear, concise, and accurate response."},
{"role": "user", "content": text}
],
temperature=0,
max_tokens=500
)
return response.choices[0].message.content
except Exception as e:
logger.error(f"Chat response error: {str(e)}")
return None

4. Text-to-Speech (OpenAI)

The system converts responses back to speech using OpenAI’s TTS service.

async def generate_speech(text: str):
"""Generate speech using OpenAI TTS"""
try:
response = await openai_client.audio.speech.create(
model="tts-1",
voice="alloy",
input=text
)

# Get the speech data directly from the response
# No need to await response.read() as the response is already the audio data
return response.content
except Exception as e:
logger.error(f"Speech generation error: {str(e)}")
return None

Key Features

1. Real-Time Processing

The application uses WebSocket connections for real-time audio streaming and processing. This enables:

  • Immediate voice activity detection
  • Continuous audio buffering
  • Dynamic silence detection
  • Real-time response generation

2. Intelligent Audio Management

max_silence_duration = 1.5  # seconds
frames_per_second = 1000 / voice_detector.frame_duration

The system intelligently manages audio buffering and silence detection to provide a natural conversation flow.

Implementation Details

WebSocket Handler

The WebSocket endpoint manages the entire conversation flow:

  1. Accepts incoming audio streams
  • Buffers audio data
  • Detects voice activity
  • Processes complete utterances
  • Generates and returns responses

Audio Processing

The system handles audio processing with careful attention to:

  • Proper file handling
  • Memory management
  • Resource cleanup
  • Format conversion

Error Handling and Logging

Comprehensive error handling and logging ensure system reliability and ease of debugging.

Best Practices Implemented

  1. Asynchronous Processing
  • Uses FastAPI’s async capabilities
  • Implements efficient resource management
  • Handles concurrent connections

2. Error Handling

  • Comprehensive try-except blocks
  • Proper resource cleanup
  • Detailed error logging

3. Configuration Management

  • Environment variable handling
  • External configuration file
  • Secure API key management

Code Implementation

Code folder structure

project_root/
├── main.py # Main FastAPI application
├── config.py # Configuration settings
├── requirements.txt # Python dependencies

├── static/ # Static files directory
│ ├── css/
│ ├── js/
│ └── assets/

├── templates/ # Jinja2 templates
│ └── index.html # Main template file

├── voice_modules/ # Voice processing modules
│ ├── __init__.py
│ └── realtime_voice_detection.py # Voice detection implementation

└── README.md # Project documentation

This structure follows FastAPI’s recommended practices for organizing web applications with:

  • Clear separation of concerns
  • Modular organization
  • Easy maintenance and scalability
  • Secure configuration management

Setup API keys

The OpenAI API keys and Groq API keys are set in the confip.py as a key value pair

config = {"OPENAI_API_KEY":"...","GROQ_API_KEY":"..."}

FastAPI implementation

The static folder is an essential part of a FastAPI application, providing a structured way to manage and serve static assets. By organizing CSS, JavaScript, and other media files, developers can create a responsive and visually appealing user interface while maintaining a clean separation between frontend and backend code.

static/
├── css/
│ └── style.css # Styling for the voice chat interface

├── js/
│ └── app.js # Client-side WebSocket handling and audio processing

└── assets/
└── audio/ # Audio-related assets (if any)

Key Components:

  1. JavaScript (app.js):
  • WebSocket connection management
  • Audio recording and streaming
  • Voice activity visualization
  • Handling server responses (transcriptions and AI responses)

2. CSS (style.css):

  • Voice chat interface styling
  • Visual feedback for voice detection
  • Response display formatting

The static files are mounted in main.py using:

app.mount("/static", StaticFiles(directory="static"), name="static")

This setup enables the real-time voice chat interface to function smoothly with WebSocket connections and provide visual feedback for voice detection and responses.

Code Logic Implementation

main.py

from fastapi import FastAPI, WebSocket, Request, WebSocketDisconnect
from fastapi.templating import Jinja2Templates
from fastapi.staticfiles import StaticFiles
from fastapi.responses import HTMLResponse
import uvicorn
import json
import asyncio
import logging
import numpy as np
from openai import AsyncOpenAI
import os
from dotenv import load_dotenv
import tempfile
import wave
from config import config
#
import webrtcvad
import numpy as np
import struct
import logging
#
# Load environment variables
load_dotenv()

os.environ["OPENAI_API_KEY"] = config.get("OPENAI_API_KEY")
os.environ["GROQ_API_KEY"] = config.get("GROQ_API_KEY")
#
#
## Audio Detection
######################################################################
class VoiceDetector:
def __init__(self, sample_rate=16000, frame_duration=30):
self.vad = webrtcvad.Vad(2) # Reduced aggressiveness for better continuous speech detection
self.sample_rate = sample_rate
self.frame_duration = frame_duration
self.frame_size = int(sample_rate * frame_duration / 1000)
logging.basicConfig(level=logging.INFO)
self.logger = logging.getLogger(__name__)
self.silence_frames = 0
self.max_silence_frames = 15 # Allow more silence between words
self.min_speech_frames = 3 # Require minimum speech frames to avoid spurious detections
self.speech_frames = 0
self.is_speaking = False

def _frame_generator(self, audio_data):
"""Generate audio frames from raw audio data."""
if len(audio_data) < self.frame_size:
self.logger.warning(f"Audio data too short: {len(audio_data)} bytes")
return []

n = len(audio_data)
offset = 0
frames = []
while offset + self.frame_size <= n:
frames.append(audio_data[offset:offset + self.frame_size])
offset += self.frame_size
return frames

def _convert_audio_data(self, audio_data):
"""Convert audio data to the correct format."""
try:
# First try to interpret as float32
float_array = np.frombuffer(audio_data, dtype=np.float32)
# Convert float32 [-1.0, 1.0] to int16 [-32768, 32767]
int16_array = (float_array * 32767).astype(np.int16)
return int16_array
except ValueError:
try:
# If that fails, try direct int16 interpretation
return np.frombuffer(audio_data, dtype=np.int16)
except ValueError as e:
# If both fail, try to pad the data to make it aligned
padding_size = (2 - (len(audio_data) % 2)) % 2
if padding_size > 0:
padded_data = audio_data + b'\x00' * padding_size
return np.frombuffer(padded_data, dtype=np.int16)
raise e

def detect_voice(self, audio_data):
"""
Detect voice activity in audio data.

Args:
audio_data (bytes): Raw audio data

Returns:
bool: True if voice activity is detected, False otherwise
"""
try:
if audio_data is None or len(audio_data) == 0:
self.logger.warning("Audio data is empty or None")
return False

# Convert audio data to the correct format
try:
audio_array = self._convert_audio_data(audio_data)
if len(audio_array) == 0:
self.logger.warning("No valid audio data after conversion")
return False
except ValueError as e:
self.logger.error(f"Error converting audio data: {str(e)}")
return False

# Process frames
frames = self._frame_generator(audio_array)
if not frames:
self.logger.warning("No frames generated from audio data")
return False

# Count speech frames in this chunk
current_speech_frames = 0
for frame in frames:
try:
# Pack the frame into bytes
frame_bytes = struct.pack("%dh" % len(frame), *frame)

# Check for voice activity
if self.vad.is_speech(frame_bytes, self.sample_rate):
current_speech_frames += 1
self.speech_frames += 1
self.silence_frames = 0
else:
self.silence_frames += 1

except struct.error as se:
self.logger.error(f"Error packing frame data: {str(se)}")
continue
except Exception as e:
self.logger.error(f"Error processing frame: {str(e)}")
continue

# Update speaking state
if current_speech_frames > 0:
if not self.is_speaking and self.speech_frames >= self.min_speech_frames:
self.is_speaking = True
return True
elif self.silence_frames > self.max_silence_frames:
if self.is_speaking:
self.is_speaking = False
self.speech_frames = 0
return False

# Keep current state if in transition
return self.is_speaking

except Exception as e:
self.logger.error(f"Error in voice detection: {str(e)}")
return False


# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Initialize OpenAI and Groq clients
openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
#
groq_client = AsyncOpenAI(
base_url="https://api.groq.com/openai/v1",
api_key=os.getenv("GROQ_API_KEY")
)

app = FastAPI()

# Mount static files
app.mount("/static", StaticFiles(directory="static"), name="static")
templates = Jinja2Templates(directory="templates")

@app.get("/", response_class=HTMLResponse)
async def get_index(request: Request):
return templates.TemplateResponse("index.html", {"request": request})

async def transcribe_audio(audio_data: bytes):
"""Transcribe audio using Groq's Whisper model"""
temp_wav = None
try:
# Create a unique temporary file
temp_wav = tempfile.NamedTemporaryFile(suffix='.wav', delete=False)
wav_path = temp_wav.name
temp_wav.close() # Close the file handle immediately

# Write the WAV file
with wave.open(wav_path, 'wb') as wav_file:
wav_file.setnchannels(1) # Mono
wav_file.setsampwidth(2) # 2 bytes per sample (16-bit)
wav_file.setframerate(16000) # 16kHz
wav_file.writeframes(audio_data)

# Transcribe using Groq
with open(wav_path, 'rb') as audio_file:
response = await groq_client.audio.transcriptions.create(
model="whisper-large-v3-turbo",
file=audio_file,
response_format="text"
)

return response

except Exception as e:
logger.error(f"Transcription error: {str(e)}")
return None
finally:
# Clean up the temporary file
if temp_wav is not None:
try:
os.unlink(temp_wav.name)
except Exception as e:
logger.error(f"Error deleting temporary file: {str(e)}")

async def get_chat_response(text: str):
"""Get chat response from Groq"""
try:
response = await groq_client.chat.completions.create(
model="llama-3.1-70b-versatile",
messages=[
{"role": "system", "content": "You are a helpful assistant. Please provide a clear, concise, and accurate response."},
{"role": "user", "content": text}
],
temperature=0,
max_tokens=500
)
return response.choices[0].message.content
except Exception as e:
logger.error(f"Chat response error: {str(e)}")
return None

async def generate_speech(text: str):
"""Generate speech using OpenAI TTS"""
try:
response = await openai_client.audio.speech.create(
model="tts-1",
voice="alloy",
input=text
)

# Get the speech data directly from the response
# No need to await response.read() as the response is already the audio data
return response.content
except Exception as e:
logger.error(f"Speech generation error: {str(e)}")
return None

@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
await websocket.accept()
logger.info("WebSocket connection established")

voice_detector = VoiceDetector()
audio_buffer = bytearray()
silence_duration = 0
max_silence_duration = 1.5 # seconds
frames_per_second = 1000 / voice_detector.frame_duration # frames per second
max_silence_frames = int(max_silence_duration * frames_per_second)

try:
while True:
try:
data = await websocket.receive_bytes()

if not data:
logger.warning("Received empty data frame")
continue

# Check for voice activity
voice_detected = voice_detector.detect_voice(data)

if voice_detected:
# Reset silence counter and add to buffer
silence_duration = 0
audio_buffer.extend(data)
await websocket.send_json({"type": "vad", "status": "active"})
else:
# Increment silence counter
silence_duration += 1

# If we were collecting speech and hit max silence, process the buffer
if len(audio_buffer) > 0 and silence_duration >= max_silence_frames:
logger.info(f"Processing audio buffer of size: {len(audio_buffer)} bytes")

# Process the complete utterance
transcription = await transcribe_audio(bytes(audio_buffer))
if transcription:
logger.info(f"Transcription: {transcription}")
await websocket.send_json({
"type": "transcription",
"text": transcription
})

# Get chat response
chat_response = await get_chat_response(transcription)
if chat_response:
logger.info(f"Chat response: {chat_response}")
await websocket.send_json({
"type": "chat_response",
"text": chat_response
})

# Generate and send voice response
voice_response = await generate_speech(chat_response)
if voice_response:
logger.info("Generated voice response")
await websocket.send_bytes(voice_response)

# Clear the buffer after processing
audio_buffer = bytearray()
await websocket.send_json({"type": "vad", "status": "inactive"})
elif len(audio_buffer) > 0:
# Still collecting silence, add to buffer
audio_buffer.extend(data)

except WebSocketDisconnect:
logger.info("WebSocket disconnected")
break
except Exception as e:
logger.error(f"Error processing websocket frame: {str(e)}")
continue

except Exception as e:
logger.error(f"WebSocket connection error: {str(e)}")
finally:
logger.info("Closing WebSocket connection")
await websocket.close()

if __name__ == "__main__":
uvicorn.run("main:app", host="0.0.0.0", port=8000, reload=True)

Invoke the application

python main.py
INFO:     Will watch for changes in these directories: ['D:\\Voice_detection_and_response']
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: Started reloader process [32716] using StatReload
INFO: Started server process [33080]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: 127.0.0.1:57903 - "GET / HTTP/1.1" 200 OK
INFO: 127.0.0.1:57903 - "GET /favicon.ico HTTP/1.1" 404 Not Found

Voice Assistant App

INFO:     ('127.0.0.1', 56153) - "WebSocket /ws" [accepted]
INFO:main:WebSocket connection established
INFO: connection open
INFO:main:Processing audio buffer of size: 165888 bytes
INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/audio/transcriptions "HTTP/1.1 200 OK"
INFO:main:Transcription: Hello, can you explain me photosynthesis?
INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
INFO:main:Chat response: Photosynthesis is the process by which plants, algae, and some bacteria convert light energy from the sun into chemical energy in the form of glucose (a type of sugar). This process is essential for life on Earth, as it provides energy and organic compounds for plants to grow and thrive.

Here's a simplified overview of the photosynthesis process:

**Step 1: Light absorption**
Light from the sun is absorbed by pigments such as chlorophyll, which is present in the cells of plants and other photosynthetic organisms.

**Step 2: Water absorption**
Water is absorbed from the soil through the roots of plants.

**Step 3: Carbon dioxide absorption**
Carbon dioxide is absorbed from the air through small openings on the surface of leaves called stomata.

**Step 4: Light-dependent reactions**
The absorbed light energy is used to convert water and carbon dioxide into a molecule called ATP (adenosine triphosphate), which is a source of energy for the plant.

**Step 5: Light-independent reactions (Calvin cycle)**
The ATP produced in the light-dependent reactions is used to convert carbon dioxide into glucose (C6H12O6) through a series of chemical reactions.

**Step 6: Oxygen release**
As a byproduct of photosynthesis, oxygen is released into the air through the stomata.

**Overall equation:**
6 CO2 (carbon dioxide) + 6 H2O (water) + light energy → C6H12O6 (glucose) + 6 O2 (oxygen)

In summary, photosynthesis is the process by which plants and other organisms convert light energy into chemical energy, releasing oxygen as a byproduct and producing glucose, which is used to fuel their growth and development.
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/audio/speech "HTTP/1.1 200 OK"
INFO:main:Generated voice response
INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
INFO:main:Chat response: Decoder-based models in transformers are a type of neural network architecture that primarily relies on the decoder component of the transformer model. Here's a concise overview:

**What is a Transformer?**

A transformer is a type of neural network architecture introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017. It's primarily used for sequence-to-sequence tasks, such as machine translation, text summarization, and text generation.

**Decoder Component**

In a transformer model, the decoder component is responsible for generating the output sequence, one token at a time. The decoder takes the output of the encoder component and uses it to generate the next token in the sequence.

**Decoder-Based Models**

Decoder-based models are a type of transformer model that focuses primarily on the decoder component. These models use the decoder to generate text, without the need for an encoder component. The input to the decoder is typically a sequence of tokens, such as a prompt or a prefix, and the output is a generated sequence of tokens.

**Key Characteristics**

Decoder-based models have the following key characteristics:

1. **Autoregressive**: Decoder-based models are autoregressive, meaning that they generate text one token at a time, based on the previous tokens in the sequence.
2. **No Encoder**: Decoder-based models do not have an encoder component, which means they do not have a separate component for encoding the input sequence.
3. **Self-Attention**: Decoder-based models use self-attention mechanisms to attend to different parts of the input sequence and generate the next token.

**Examples of Decoder-Based Models**

Some examples of decoder-based models include:

1. **Language Models**: Language models, such as BERT and RoBERTa, use a decoder-based architecture to generate text.
2. **Text Generation Models**: Text generation models, such as transformer-XL and XLNet, use a decoder-based architecture to generate text.
3. **Chatbots**: Chatbots, such as those built using the transformer-XL architecture, use a decoder-based architecture to generate responses to user input.

**Advantages**

Decoder-based models have several advantages, including:

1. **Flexibility**: Decoder-based models can be fine-tuned for a variety of tasks, such as text generation, language translation, and text summarization.
2. **Efficiency**: Decoder-based models can be more efficient than encoder-decoder models, since they do not require a separate encoder component.
3. **Improved
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/audio/speech "HTTP/1.1 200 OK"
INFO:main:Generated voice response
INFO:main:Processing audio buffer of size: 143360 bytes
INFO: 127.0.0.1:57312 - "GET /favicon.ico HTTP/1.1" 404 Not Found
INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/audio/transcriptions "HTTP/1.1 200 OK"
INFO:main:Transcription: E aí
INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"
INFO:main:Chat response: Tudo bem! Como posso ajudar você hoje?
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/audio/speech "HTTP/1.1 200 OK"
INFO:main:Generated voice response

Future Improvements

  1. Scalability
  • Implement connection pooling
  • Add rate limiting
  • Optimize resource usage

2. Features

  • Multiple voice options
  • User session management
  • Response customization
  • Audio format optimization

3. Security

  • Enhanced authentication
  • Request validation
  • Rate limiting

Conclusion

This project demonstrates the power of combining modern web frameworks with AI services to create sophisticated voice-based applications. The architecture provides a solid foundation for building scalable, real-time voice processing systems.

The combination of FastAPI’s performance, WebSocket’s real-time capabilities, and state-of-the-art AI services creates a powerful platform for voice-based interactions. This architecture can be extended for various applications, from virtual assistants to accessibility tools.

Resources

connect with me

--

--

The AI Forum
The AI Forum

Published in The AI Forum

Its AI forum where all the topics spread across Data Analytics, Data Science, Machine Learning, Deep Learning are discussed.

Plaban Nayak
Plaban Nayak

Written by Plaban Nayak

Machine Learning and Deep Learning enthusiast

Responses (1)