Building an Advanced Voice Assistant with Langchain

3 min readMar 11, 2024

As a passionate developer and enthusiast for AI technologies, I recently embarked on an exciting project to create an advanced voice assistant named Jarvis. Inspired by the endless possibilities of conversational AI, my goal was to develop an assistant that not only understands and responds to voice commands but also provides a customizable and interactive experience. In this article, I’ll walk you through the journey of building Jarvis, showcasing the code piece by piece. I encourage you to test Jarvis out for yourself and explore your own improvements, such as adding Retrieval Augmented Generation (RAG) capabilities.

Setting the Foundation

The first step in creating Jarvis was to set up our primary dependencies and environment. I used several powerful libraries, including langchain_openai for leveraging OpenAI's GPT models, pyttsx3 for text-to-speech functionality, and speech_recognition for converting spoken language into text. To manage sensitive data like the OpenAI API key, I utilized python-dotenv.

Use an alternative LLM running on your computer

You can pass the argument “ — base_url” with the url of your alternative chat model.

python jarvis --base_url http://localhost:1234/v1

Here’s how I started the project, ensuring all necessary libraries were imported and the environment was correctly set up:

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
...
import pyttsx3
import speech_recognition as sr
import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())  # Read local .env file

Configuring the Voice Assistant

To make Jarvis as flexible and adaptable as possible, I implemented a configuration system using command-line arguments. This allows users to easily customize the OpenAI model, the voice of the text-to-speech engine, and various other settings without modifying the code directly.

parser = argparse.ArgumentParser()
parser.add_argument("--list_voices", action="store_true", help="List the available voices for the text-to-speech engine")
...
args = parser.parse_args()

Integrating OpenAI’s API

The core of Jarvis’s intelligence lies in its integration with OpenAI’s GPT models. By leveraging the langchain_openai library, I was able to utilize these models for generating responses to user commands. I set up a flexible prompt system to allow Jarvis to understand its role as an assistant and respond accordingly.

llm = ChatOpenAI(temperature=temperature, model=llm_model, base_url=base_url, api_key=api_key)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You're an assistant who's good at {ability}. Respond in 20 words or fewer"),
    ...
])

Speech Recognition and Synthesis

Understanding and responding to human speech is crucial for any voice assistant. I used the speech_recognition library to convert speech to text and pyttsx3 for the opposite process. This setup allowed Jarvis to listen to user commands and respond with spoken words.

# Set up the speech recognition engine
r = sr.Recognizer()

def listen():
    with sr.Microphone() as source:
        audio = r.listen(source, phrase_time_limit=5)
    ...
    
# Set up the text-to-speech engine
engine = pyttsx3.init()
...

Bringing Jarvis to Life

With all the pieces in place, the final step was to create a loop where Jarvis continuously listens for commands, processes them, and speaks the response. I also implemented a push-to-talk feature, enabling users to control when Jarvis listens.

speak("Hello, I am Jarvis. How can I help you today?")

while True:
    if ptt:
        input("Press Enter to start recording...")
    ...
    prompt = listen()
    if prompt is not None:
        response = generate_response(prompt)
        speak(response)
    else:
        speak("I'm sorry, I didn't understand that.")

Use the force, get the code

You can get the latest code at https://github.com/meirm/jarvis

Conclusion and Next Steps

Building Jarvis has been an incredibly rewarding experience, showcasing the power of modern AI and speech technologies. However, this is just the beginning. One exciting direction for further improvement is the integration of Retrieval Augmented Generation (RAG), which would allow Jarvis to pull information from external sources to enrich its responses.

I encourage you to dive into the code, experiment with Jarvis, and explore your own modifications. Whether it’s enhancing its conversational abilities, integrating with other APIs, or experimenting with different language models, the possibilities are endless. Happy coding, and may your journey with Jarvis be as enlightening as mine has been!