How to build your own version of Iron Man’s voice-activated assistant J.A.R.V.I.S., using ChatGPT

Jasmine Plows
7 min readJan 11, 2023
J.A.R.V.I.S., as depicted in Age of Ultron (2015)

Note: I used a Raspberry Pi for this, so this tutorial is tailored for Raspberry Pi OS systems. However, instructions should be similar/the same for macOS and Linux and should require only minor changes for Windows.

Recently, I was re-watching Iron Man and the excellent new Star Wars series Andor, and was struck by how helpful the voice-activated assistants are in these universes. Both J.A.R.V.I.S., from Iron Man, and Class 3 droids in Star Wars act as fully intelligent robotic assistants. They understand language, provide a useful function, and then perfectly articulate a response. Of course, we’ve had voice-activated virtual assistants such as Amazon’s Alexa and Apple’s Siri for about a decade now, but they each leave something to be desired. For one, they never seem to understand when you’re talking to them (I find myself yelling “ALEXA” over and over again until I get a response), and when they do, they often interpret your question incorrectly. However, their biggest limitation, in my opinion, is how infrequently they are able to actually answer questions. I’ll often have a question I want answered but Alexa will be unable to meaningfully help me answer it. Instead, she’ll bring up a google search on my phone, requiring me to pull out my phone and manually scroll through the Google search page. To me, this isn’t all that useful, and it is a far cry from what J.A.R.V.I.S. and the droids of the Star Wars universe offer.

J.A.R.V.I.S. Stands for Just A Rather Very Intelligent System

A.I.’s are getting more advanced though. Last November, OpenAI launched ChatGPT. ChatGPT stands for Generative Pre-Trained Transformer, and it is an A.I.-powered chatbot that can do all sorts of things. For starters, it can act as a search engine that answers your question in full sentences rather than requiring you to click on links, find the relevant information, and interpret it yourself. But it can also be creative, and can write its own poems, essays, and screenplays, all within seconds. Since its launch, I’ve been amazed at its functionality, and I have almost completely replaced Google Searches with it (something that Google itself has been afraid of). Any question I have, even coding questions, it answers effortlessly and within seconds. Try it for yourself, here.

The limitation, however, is that you must log in to the OpenAI website and type your question each time you want to use it. What if you could have your own personal ChatGPT that could understand you, answer your question, and then read aloud your answer, so you would never have to pull out your phone? You would essentially have your own J.A.R.V.I.S.! A virtual assistant that is highly intelligent and able to understand language and audibly relay information back to you. When I received a Raspberry Pi 4B (an inexpensive single board computer that is great for projects like these) for Christmas, I thought this would be a great first project.

I started in a very meta manner — I asked ChatGPT itself as to the best way to do my project, and here’s what it said:

I used ChatGPT to figure out how to make a ChatGPT virtual assistant

Amazed at its detailed answer, I decided to follow ChatGPT's advice, and I also asked it questions when I got stuck. Overall, the entire process took about 5 hours of trial-and-error before I had a fully working prototype, and the code is surprisingly simple. Here it is:

UPDATE 2023/01/17: I have added “hey, Jarvis!” capability, and have also changed the text-to-speech library so it has a male (albeit, very robotic, unlike J.A.R.V.I.S.) voice. Feel free to check out previous commits of the code at the GitHub repo.

import time
import speech_recognition as sr
import os
import openai
import pyttsx3

# Function to transcribe audio, send to ChatGPT, and read aloud
def listen_and_respond(after_prompt=True):
"""
Transcribes audio, sends to ChatGPT, and responds in speech

Args:
after_prompt: bool, whether the response comes directly
after the user says "Hey, Jarvis!" or not

"""
# Default is don't start listening, until I tell you to
start_listening = False

with microphone as source:

if after_prompt:
recognizer.adjust_for_ambient_noise(source)
print("Say 'Hey, Jarvis!' to start")
audio = recognizer.listen(source, phrase_time_limit=5)
try:
transcription = recognizer.recognize_google(audio)
if transcription.lower() == "hey jarvis":
start_listening = True
else:
start_listening = False
except sr.UnknownValueError:
start_listening = False
else:
start_listening = True

if start_listening:
try:
print("Listening for question...")
audio = recognizer.record(source, duration=5)
transcription = recognizer.recognize_google(audio)
print(f"Input text: {transcription}")

# Send the transcribed text to the ChatGPT3 API
response = openai.Completion.create(
engine="text-davinci-003",
prompt=transcription,
temperature=0.9,
max_tokens=512,
top_p=1,
presence_penalty=0.6
)

# Get the response text from the ChatGPT3 API
response_text = response.choices[0].text

# Print the response from the ChatGPT3 API
print(f"Response text: {response_text}")

# Say the response
engine.say(response_text)
engine.runAndWait()

except sr.UnknownValueError:
print("Unable to transcribe audio")


# pyttsx3 engine paramaters
engine = pyttsx3.init()
engine.setProperty('rate', 150)
engine.setProperty('voice', 'english_north')

# My OpenAI API Key
openai.api_key = os.environ["API_KEY"]

recognizer = sr.Recognizer()
microphone = sr.Microphone()

# First question
first_question = True

# Initialize last_question_time to current time
last_question_time = time.time()

# Set threshold for time elapsed before requiring "Hey, Jarvis!" again
threshold = 60 # 1 minute

while True:
if (first_question == True) | (time.time() - last_question_time > threshold):
listen_and_respond(after_prompt=True)
first_question = False
else:
listen_and_respond(after_prompt=False)

# Can run in terminal with following command to suppress warnings:
# python jarvis.py 2>/dev/null

I tried a few different speech-to-text libraries, but SpeechRecognition was by far the simplest and most accurate (I found pocketsphinx was often unable to transcribe my New Zealand accent). I then used this tutorial to access ChatGPT via Python (there is no official OpenAI ChatGPT API yet) and the pyttsx3 library to read aloud the response (Note: I previously used gTTS but it doesn’t allow you to change the voice, so I switched). Here are the step by step instructions if you want to get this working (all in about 10 mins!):

Step 1: Clone or fork my GitHub repository here.

Step 2: Create a virtual environment (on a Raspberry Pi, here are the instructions):

$ python3 -m venv jarvis 

Step 3: Activate your new virtual environment:

$ source jarvis/bin/activate

Step 4: Install requirements.txt into your virtual environment:

$ cd <path-to-jarvis-folder-where-requirements.txt-is>

$ pip install -r requirements.txt

Step 5: Plug in a USB or Bluetooth microphone (if you don’t already have one built-in to your machine) and make sure it is set as your default microphone.

Step 6: Plug in USB or Bluetooth speaker (if you don’t already have one built-in to your machine/monitor)

Step 7: Go to openai.com and create an account (it will prompt you to do this when you try to use ChatGPT on the web).

Step 8: Go to settings and click on API keys. You get $18 in free API credit that can be used during your first three months of having an account. After that, you will have to pay for API calls. But you don’t need to enter your credit card information to get started, which is great (Note: if you made an Open AI account a long time ago, your free credits may have expired. In this case, either pay or set up a new account).

Step 9: Copy your key. Create a new one if you can’t access it.

Step 10: Now you need to add your API key such that your script can access it. NOTE: DO NOT SHARE YOUR API KEY ANYWHERE ELSE AND BE CAREFUL NOT TO ACCIDENTALLY MAKE IT PUBLIC.

$ nano ~/.bashrc

Then add the following line at the end of the .bashrc file:

export API_KEY=“your_key_here”

Ctrl + X to exit, “Y” to save, then “Enter”.

Step 11: Now that you have everything set up, all you need to do is this:

$ cd <path-to-jarvis-folder>/src

$ python jarvis.py

If you get a bunch of annoying warnings, type this command instead to suppress them:

$ python jarvis.py 2>/dev/null

Note, however, that doing this will suppress errors too, so you won’t know what broke if something breaks.

Step 12: The program will commence with the following prompt:

Say 'Hey, Jarvis!' to start

All you need to do is say “Hey, Jarvis” aloud.

When the program picks up on that prompt, you will see the following text:

Listening...

Now ask your question, and wait for the answer! The terminal will also print your question as well as the response.

You can continue asking questions without saying “Hey, Jarvis” after that (unless you let more than a minute pass, in which case you’ll have to say “Hey, Jarvis” again).

Now you have a fully working J.A.R.V.I.S. prototype! Of course, all it can do at this point is answer questions that ChatGPT can answer, but personally I find it much more useful than Alexa or Siri for that purpose.

However, I have some ideas for ways to improve it which I will add as additional parts of this blog in the future:

  1. It’s a little annoying to have to wait after saying “Hey, Jarvis” before asking the first question. Alexa and Siri allow you to continue speaking immediately after the initial prompt. This shouldn’t be too hard to add, so I’ll update here when I can. I would also like to have this running using a Jabra system, which is a fully wireless microphone and speaker all-in-one. As long as the program is running, the Jabra is in Bluetooth range, and the wi-fi is working, I will be able to carry my Jabra around the house as my own portable wireless J.A.R.V.I.S.
  2. I have a customizable R2 unit droid that I built at Disneyland’s Galaxy’s Edge in 2019. Currently, it can only move around (via remote control) and make noises/light up. I am thinking of opening it up, adding a Raspberry Pi Zero W with a small microphone and speaker (or re-wiring the existing speaker) and a battery pack, and then running the program from there. Then I won’t need the Raspberry Pi 4B at all, and I could have my own little intelligent droid!
  3. I would, of course, like to add functionality to help me with things other than question-answering. Right now, it can only answer questions, but I could use it to turn on lights, adjust the thermostat, or tell me which of my plants need to be watered. The possibilities are endless!

I’ll add to this blog as I complete these projects, but I hope this is helpful in the meantime!

--

--