Building an AI Assistant for Vision Pro!

a guide about creating SpatialGPT- An AI Voice Assistant for the the Apple Vision Pro!

17 min readFeb 1, 2024

Apple just released the Vision Pro and it’s literally about to change the way we have high quality experiences — think of immersive tours, games, and educational experiences, business use cases — the ideas are limitless with Vision Pro and were only getting started.

Why does the release of a VR headset making so much noise? Because it’s Apple. Their commitment to high quality user experiences are always the first thing to mind with any product throughout the product development cycle. They were the first ones in the space to release a VR headset that has:

Eye Tracking as the main cursor for selection for screens
Micro-OLED screens for 4K resolution screen
Use of Hands & Voice as main input

Now this being said, I’ve thought about a number of ideas in my head in ways I can really innovate with the Vision Pro to push the boundaries. To illustrate this, i’m going to take you on a quick journey.

Let’s say hypothetically, my cousin was coming to the Bay Area to visit me and he wanted to see different American museums around the country.👇

nyc moma, art institute of chicago, & smithosonian air museum in DC

Now, building immersive experiences for my cousin to hop from museum to museum would be really cool! — but i’ve also realized that just simply being immersed can feel static. That’s why I’ve been building in visionOS to integrate an interactive assistant using OpenAI’s API.

Now back to the story…

My cousin’s favorite piece of art is Starry Night by Vincent Van Gogh — This is currently at NY Moma and he would really like to visit this and actually get more information about the painting, but he wants his answer to be in Hindi! Watch below ⬇️

spatial gpt quick demo

🤯 Just from asking that simple question to chat in a virtual space, my cousin’s mind was blown.

That was a quick demo of SpatialGPT, Let’s dive deeper and show you how I built this app from the ground up!

We can split the project into four main sections:

(1) 🛠️ Project Set-Up

(2)👨🏽‍💻 Back-End Algorithms

(3) 🎙️ Front-End Design

(4)🤖 Integrating OpenAI’s API

🛠️ Project Set-Up

After we click “new project” in Xcode, we want to set up four key areas for the project:

Connecting to OpenAI
Interoperability between iOS, macOS, and visionOS
Importing Siri Waveform UI
Screen Standardization

Connecting to OpenAI

To connect to OpenAI, we simply need to click on the project name and on the left-hand side, click the project name that has the word “TARGETS” bolded on the top. Click on “Outgoing Connections (Client)” in Network. In Hardware, select “Audio Input”.

We also want to add a privacy property on the info tab of this window. You can label it “Talk with AI Assistant” for the value.

To import the API we PROJECTS Tab > Package Dependencies. Add this link : https://github.com/alfianlosari/XCAOpenAIClient. You could also search “XCAOpenAIClient”

Interoperability between iOS, macOS, and visionOS

Go back to the “TARGETS” page > General > Supported Destinations. Make sure to add Apple Vision, iPhone, iPad, and Mac for easier testing than the simulator (it can be quite slow).

Some other things to remeber here: For vision pro, your Mac must be on the new silicon than intel based. Your Xcode should be 15.2 or higher (this includes the current visionOS SDK!)

Importing Siri Waveform UI

One of the biggest tasks in the front-end of the code will be controlling the speed of waveform animations in the response state of the assistant. To do this (& honestly make our lives easier), we are going to import the UI from another git hub repo.

Use the same steps as the one for Chat but add this link instead: https://github.com/alfianlosari/SiriWaveView or search up “SiriWaveView”. With the packages installed in our Xcode project, we also want to set some screen standardization for the app in what developers call the “App File”

Screen Standardization

When we open Xcode, we are going to see a file on the right handside with our app name — open that up and we are going to set some screen standardization based on the OS the app is running on!

@main
struct XCAAiAssistantApp: App {
    var body: some Scene {
        WindowGroup {
            ContentView()
            #if os(macOS)
                .frame(width: 400, height: 400)
            #endif
        }
        #if os(macOS)
        .windowStyle(.hiddenTitleBar)
        .windowResizability(.contentSize)
        #elseif os(visionOS)
        .defaultSize(width:0.4,height:0.4,depth:0.0,in:.meters)
        .windowResizability(.contentSize)
        #endif
    }
}

In this initial set-up code, we are passing in ContentView(), which will hold our front-end design and setting up conditional compilation statements to set certain things based on the OS!

The first one essentially says is that if the app is being run on macOS set the frame of the app to a width of 400 pixel and a height of 400 pixels. The second statements tells the program to hide the title and keep the windows to .contentSize (an Apple standard in SwiftUI) if on macOS.

If on visionOS, have the window size by 0.4 m x 0.4 m x 0.0 m and the same .contentSize standard for window resizability.

With our packages installed & operating systems optimized , let’s move into the first part of coding — building out the back-end algorithms!

👨🏽‍💻 Back-End Algorithms

To set up the back-end for the application, we’ll create a file called ViewModel.swift which will store variables and functions that our algorithms will be stored in so that our app will be able to use TTS and communicate with OpenAI for a controlled system!

This being said, ViewModel.swift is split up into three main sections:

Imports & ViewModel Setup
Variables
Functions

Imports & ViewModel Setup

import AVFoundation
import Observation
import Foundation
import XCAOpenAIClient

To setup ViewModel, we need to import four different libraries — AVFoundation (AV Player Framework — check out my work here on this), Observation( a type-safe framework that allows for an observable to be called), Foundation ( framework for all of Swift), and XCAOpenAIClient (the package we imported in the last section)

@Observable
class ViewModel: NSObject, AVAudioRecorderDelegate, AVAudioPlayerDelegate{

The next step is to call the macro of observable and create the class ViewModel() with a couple of attributes — NSObject (will serve as a bridge to old objective-c protocols), AVAudioRecorderDelegate (protocol that allows for audio recording and encoding), and AVAudioPlayerDelegate ( protocol that allows for audio playing)

Variables

    let client = OpenAIClient(apiKey: "ENTER API KEY HERE")
    var audioPlayer: AVAudioPlayer!
    var audioRecorder: AVAudioRecorder!

The first set of variables is client, audioPlayer, and audioRecorder which passes in your own API Key (you have to get this from chat gpt developer’s site: look here for a step-by-step tutorial by Pawan Yadav!), a AVAudioPlayer data type and a AVAudioRecorder data type w/optional statements denoted by the !

#if !os(macOS)
    var recordingSession = AVAudioSession.sharedInstance()
    #endif
    var animationTimer: Timer?
    var recordingTimer: Timer?
    var audioPower = 0.0
    var prevAudioPower: Double?
    var processingSpeechTask: Task < Void,Never>?

The second segment of the code starts with a compile statement saying that if the OS is not macOS than intialize a variable called recordingSession of type AVAudioSession.sharedInstance() which allows for the app to optimize the audio source and find the nearest one. For example, when using the visionOS simulator, sharedInstance allows for the app to tap into my mac’s mics.

The next variables are animationTimer & recordingTimer which are of type Timer and an optional variable denoted by the ‘?’ The other variables are audioPower, prevAudioPower, and processingSpeechTask. The first two pass in an audio number in the form of a decimal point and a Double optional to be used for the following functions. The latter passes in a Task type which refers to work done asychronously by the app. The Void, Never paramters ensure that the task is non-failing and has no return value.

 var selectedVoice = VoiceType.alloy
    
    var captureURL: URL {
        FileManager.default.urls(for: .cachesDirectory, in: .userDomainMask)
            .first!.appendingPathComponent("recording.m4a")
    }
    var state = VoiceChatState.idle {
        didSet { print(state) }
    }
    var isIdle: Bool {
        if case .idle = state {
            return true
        }
        return false
    }

The third segment of variables is where it get’s a little more tricky, (but I got your back!)

There are four variables in this set: selectedVoice, captureURL, state, & isIdle . Here’s what they do:

(1) selectedVoice is making a default VoiceType variables of case alloy (this is coming from chat-gpt)

(2) captureURL is returning a URL that is capturing the sound recording of your query into file called recording.m4a

(3) state is returning the app’s state, passing in VoiceChatState.idle as the default.

(4) isIdle is a boolean that checks if the state of the app is Idle or not

 var siriWaveFormOpacity: CGFloat {
        switch state {
        case.recordingSpeech, .playingSpeech: return 1
        default: return 0
            
        }
        
    }

The last variable of the back-end is siriWaveFormOpacity which will allow us to present a virtual cue if the person is speaking or not. If the state is .recordingSpeech or .playingSpeech, the opacity will be 1, or else it will default to 0.

With all these variables we can start building out our functions!

Functions

    func resetValues(){
        audioPower = 0
        prevAudioPower = nil
        audioRecorder?.stop()
        audioRecorder = nil
        audioPlayer?.stop()
        audioPlayer = nil
        recordingTimer?.invalidate()
        recordingTimer = nil
        animationTimer?.invalidate()
        animationTimer = nil
    }
}

The first function is resetValues — sets all variables to 0, nil, .stop, or .invalidate .

    func audioRecorderDidFinishRecording(_ recorder: AVAudioRecorder, successfully flag: Bool) {
        if !flag {
            resetValues()
            state = .idle
        }
    }
    
    func audioPlayerDidFinishPlaying(_ player: AVAudioPlayer, successfully flag: Bool) {
        resetValues()
        state = .idle
    }

The next two functions check if the user’s recording finished & and if the audio player finished playing chat’s message! Both functions pass in the resetValues() function and set the state to idle.

    func cancelRecording(){
        resetValues()
        state = .idle
    }
    func cancelProcessingTask(){
        processingSpeechTask?.cancel()
        processingSpeechTask = nil
        resetValues()
        state = .idle
    }

The third segment of functions are used to cancel the recording and to cancel the processing task when the query is sent to ChatGPT! Both functions will pass in resetValues() and set the state to idle. (We’ll set up the processing task function in the last section of blogpost!)

 func playAudio(data:Data) throws{
        self.state = .playingSpeech
        audioPlayer = try AVAudioPlayer(data: data)
        audioPlayer.isMeteringEnabled = true
        audioPlayer.play()
        
        animationTimer = Timer.scheduledTimer(withTimeInterval: 0.2, repeats: true, block:{ [unowned self]_ in
            guard self.audioPlayer != nil else {return}
            self.audioPlayer.updateMeters()
            let power = min(1, max(0,1-abs(Double(self.audioPlayer.averagePower(forChannel: 0))/160) ))
            self.audioPower = power
        })
        
        
    }

The next function is 1 of 2 what I call “main functions” that has the most math involved. playAudio first passes in the state variable to be .playingSpeech. It then sets up 3 lines of code for the audioPlayer object. It will first pass in the data being sent from Chat-GPT, enable metering (shows the audio waves for the answer), and then plays the audio.

The algorithm for the timer is as follows: the timer will parse through this block of code every 0.2 seconds to do the following:

checks if the audioPlayer is still running or not
updates the audio’s metering data based on the incoming file
calculates a normalized value for the audio power level and then setting it in a range from 0–1 to easily track the audio power.

 func startCaptureAudio() {
        resetValues()
        state = .recordingSpeech
        do{
            audioRecorder = try AVAudioRecorder(url: captureURL,
                                                settings: [
                                                    AVFormatIDKey: Int(kAudioFormatMPEG4AAC),
                                                    AVSampleRateKey: 12000,
                                                    AVNumberOfChannelsKey: 1 ,
                                                    AVEncoderAudioQualityKey: AVAudioQuality.high.rawValue
                                                    
                                                ])
            audioRecorder.isMeteringEnabled = true
            audioRecorder.delegate = self
            audioRecorder.record()

Here is the other main function — startCaptureAudio(). We’ll start the function by going through and adjusting the settings for the audioRecorder variable. We’ll pass in captureURL, and specific settings for the AVFormat.

We’ll also keep updating the audioRecording settings with .isMeteringEnabled, .delegate, and .record.

 animationTimer = Timer.scheduledTimer(withTimeInterval: 0.2, repeats: true, block:{ [unowned self]_ in
                guard self.audioRecorder != nil else {return}
                self.audioRecorder.updateMeters()
                let power = min(1, max(0,1-abs(Double(self.audioRecorder.averagePower(forChannel: 0))/50) ))
                self.audioPower = power
            })
            recordingTimer = Timer.scheduledTimer(withTimeInterval: 1.6, repeats: true, block: {[unowned self]_ in
                guard self.audioRecorder != nil else {return}
                self.audioRecorder.updateMeters()
                let power = min(1, max(0,1-abs(Double(self.audioRecorder.averagePower(forChannel: 0))/50) ))
                if self.prevAudioPower == nil {
                    self.prevAudioPower = power
                    return

We’ll also set two different timers that do the same thing above in the article! The only difference to note here between the two functions:

The recording timer is monitoring the audio power of the query as it comes in and code is being executed 1.6 seconds instead of 0.2 seconds
The denominator in the normalized power algorithm is 50 instead of 160
We are also analyzing audioPower in different times, hence the variable of prevAudioPower.

  if let prevAudioPower = self.prevAudioPower, prevAudioPower < 0.25 && power < 0.175 {
                    self.finishCaptureAudio()
                    return
                }
                self.prevAudioPower = power
                 
            })
        }catch {
            resetValues()
            state = .error(error)
            
        }
    }

The last part of this function before heading into finishCaptureAudio is to run a quick test to check to see if the prevAudioPower is less than 0.25 and power is less than 0.175, to then pass through the finishCaptureAudio function!

func finishCaptureAudio() {
        resetValues()
        do {
            let data = try Data(contentsOf: captureURL)
            processingSpeechTask = processSpeechTask(audioData: data)
        }catch {
            state = .error(error)
            resetValues()
        }
    }

The last part in the capturing audio phase is finishCaptureAudio(). This will convert the captured audio data and pass it into the processingSpeechTask which will send the audio to Chat!

override init() {
        super.init()
        #if !os(macOS)
        do {
            #if os(iOS)
            try recordingSession.setCategory(.playAndRecord, mode: .voiceChat, options: [.defaultToSpeaker, .allowBluetooth])
            #else
            try recordingSession.setCategory(.playAndRecord, mode: .spokenAudio)
            #endif
            try recordingSession.setActive(true)

            recordingSession.requestRecordPermission { [unowned self] allowed in
                if !allowed {
                    self.state = .error("Recording not allowed by the user")
                }
            }
            
        } catch {
            state = .error(error.localizedDescription)
        }
        #endif
    }

The last function to know before we get into designing the front-end of the application is more about security & and an override init if the shared instance does not work.

If the OS is IOS, the app will default to .voiceChat mode via the speaker and allow bluetooth connections. If the app is not running in iOS, it will default to .spokenAudio . The app will also always ask for record permission regardless of OS.

With the main functions, variables, and packages, we can start implementing the User Interface!

🎙️ Front-End Design

To set up the front-end, there are two swift files we have to create — Models & Content View

Models.swift

This file is an input for ContentView and the back-end. Writing these out is a going to be a big time-saver in writing back-end and front-end code.

Let’s get started below with this file 👇

import Foundation

enum VoiceType: String,Codable,Hashable,Sendable,CaseIterable {
    case alloy
    case echo
    case fable
    case onyx
    case nova
    case shimmer
}

First thing to note here, Models is a swift file because we are not actually coding any user interface — we are just setting up some variables and cases to refer back to.

In the first enum or enumeration, VoiceType — we set 6 cases for 6 different voices (alloy, echo, fable, onyx, nova, & shimmer) that are being imported from the Text-To-Speech (TTS) API. (We imported this in the project set-up phase if you’re confused.)

enum VoiceChatState {
    case idle
    case recordingSpeech
    case processingSpeech
    case playingSpeech
    case error(Error)
}

The second enum is setting up the different States for the application which is VoiceChatState. There are five different cases that this app could be in — idle (resting state), recordingSpeech (recording the user’s speech), processingSpeech (using TTS to send to Chat GPT), playingSpeech(the answer or query is being read out) and an error state.

With this down, let’s jump into ContentView.swift

ContentView.swift

When we start this file, we want to import the following libraries as this is going to be the basis for the front-end design of the application:

import SwiftUI
import SiriWaveView

The next step is to initialize previews with the different states, starting from case error:

#Preview ("Error") {
    let vm = ViewModel()
    vm.state = .error("An error has occured")
    return ContentView(vm: vm)
}

In the error case, we set the variable vm to ViewModel() and call in the function vm.state = .error(“An error has occured”)

case playingSpeech:

#Preview ("Playing Speech") {
    let vm = ViewModel()
    vm.state = .playingSpeech
    vm.audioPower = 0.3
    return ContentView(vm: ViewModel())
}

The playingSpeech preview initializes the variable vm as ViewModel(), sets the state of vm = .playingSpeech ( from models.swift), audioPower to 0.3 (audio variable for the power of the audio input), and a return statement to ContentView.

case processingSpeech:

#Preview ("Processing Speech") {
    let vm = ViewModel()
    vm.state = .processingSpeech
    return ContentView(vm: vm)
}

The processingSpeech preview is the same format as before, the only change is we set the state of the preview to . processingSpeech, and no audioPower variable needs to be changed here.

case recordingSpeech:

#Preview ("Recording Speech") {
    let vm = ViewModel()
    vm.state = .recordingSpeech
    vm.audioPower = 0.2
    return ContentView(vm: vm)
}

The recordingSpeech preview is the same format as before, only changes are to the state, and the audioPower level decreased by 0.1.

case idle:

#Preview ("Idle") {
    ContentView()
}

The idle preview just passes in ContentView() to show the app in the idle state.

With previews set for an easier dev process, let’s design the UI of the application!

struct ContentView: View {
    @State var vm = ViewModel()
    @State var isSymbolAnimating = false

We are going to go back to the top of the file after our import lines and initialize a struct called ContentView and start with two State initializers (marking the variables with State initializers allows the user to see more accurate information) — vm and isSymbolAnimating.

These two variables will come in handy as we will need to pass in ViewModel() heavily in the front-end and the isSymbolAnimating boolean allows us to properly control the animation for the brain icon when the TTS file is sent to OpenAI for processing. Let’s move to the overall layout of the app!

 var body: some View {
        VStack(spacing: 16){
            Text("XCA AI Voice Assistant")
                .font(.title2)
            Spacer()
            SiriWaveView()
                .power(power: vm.audioPower)
                .opacity(vm.siriWaveFormOpacity)
                .frame(height:256)
                .overlay{ overlayView }
            
            Spacer()

In this part of the code, we’ll give the set the title of the app, and pass in the SiriWaveView() and give them both some modifiers. The most important to note here are that for the .power modifier, we’ll pass in ViewModel’s audioPower variable and for the .opacity modifier, we’ll pass in ViewModel’s siriWaveFormOpacity variable.

            switch vm.state {
            case .recordingSpeech:
                cancelRecordingButton

            case .processingSpeech, .playingSpeech:
                cancelButton

            default:EmptyView()
            }

This segment of the code is to insert switch statement that will pass certain buttons to different states of the app. The next step is to create the voice picker at the bottom of the home screen!

  Picker("Select Voice", selection: $vm.selectedVoice) {
                ForEach(VoiceType.allCases,id: \.self){
                    Text($0.rawValue).id($0)
                }
            }
            .pickerStyle(.segmented)
            .disabled(!vm.isIdle)
            
            if case let .error(error) = vm.state {
                Text(error.localizedDescription)
                    .foregroundStyle(.red)
                    .font(.caption)
                    .lineLimit(2)
            }
        }
        .padding()
    }

This segment will create a picker at the bottom of this VStack and passes in all the VoiceTypes we initialized in ViewModel() and Models.swift. This code also has an if loop to format the error message if there is an issue with the app!

With this settled, let’s get to creating our overlay buttons for different states!

cancelButton

 var cancelButton: some View {
        Button(role: .destructive){
            vm.cancelProcessingTask()
        }label: {
            Image(systemName:"stop.circle.fill")
                .symbolRenderingMode(.monochrome)
                .foregroundStyle(.red)
                .font(.system(size: 44))
        }.buttonStyle(.borderless)
    }
    
}

The first button is the cancelButton which will cancel the query anytime after the recording ends till chat finishes your answer. The button utilizes ViewModel’s cancelProcessingTask function to work properly end to end.

cancelRecordingButton

 var cancelRecordingButton : some View {
        Button(role: .destructive) {
            vm.cancelRecording()
        }label: {
            Image(systemName:"xmark.circle.fill")
                .symbolRenderingMode(.multicolor)
                .font(.system(size: 44))
        }.buttonStyle(.borderless)
    }

The second button is the cancelRecordingButton which will cancel any recording from the time you hit the mic and it sends the file to chat. The button utilizes ViewModel’s cancelRecording function to get it to work properly end to end.

startCaptureButton

 var startCaptureButton: some View {
        Button{
            vm.startCaptureAudio()
        }label:{
            Image(systemName:"mic.circle")
                .symbolRenderingMode(.multicolor)
                .font(.system(size: 128))
        }.buttonStyle(.borderless)
    }

The last button is the startCaptureButton which starts the voice query from the user. The button utilizes ViewModel’s startCaptureAudio function to get it to work from end -to- end! With the last button made, we can animate the brain!

@ViewBuilder
    var overlayView: some View {
        switch vm.state{
        case .idle,.error:
            startCaptureButton
        case .processingSpeech:
            Image(systemName:"brain")
                .symbolEffect(.bounce.up.byLayer, options: .repeating, value: isSymbolAnimating)
                .font(.system(size: 128))
                .onAppear{ isSymbolAnimating = true}
                .onDisappear{isSymbolAnimating = false}
        default: EmptyView()
        }
    }

The final part of the front-end design is to animate the brain screen to signify that the voice query is being processed by OpenAI! We start this code by making sure that this part of the code is before all the buttons we made earlier (make sure to work backwards for the UI). We’ll intialize ViewBuilder to allow for multiple types of UI components can fit on 1 view. We’ll then initialize a view variable called overlayView with a switch statement for the brain animation to only take place during .processingSpeech.

For the brain animations the modifiers to look out for is .symbolEffect and pass in the boolean, isSymbolAnimating, for the value parameter. We also want to pass in .onAppear and .onDisappear and set the boolean accordingly.

With the back-end & front-end done, let’s go back to ViewModel to input the OpenAI Integrations!

🤖 Integrating OpenAI’s API

   func processSpeechTask(audioData: Data) -> Task<Void, Never> {
        Task {@MainActor [unowned self] in
            do {
                self.state = .processingSpeech
                let prompt = try await client.generateAudioTransciptions(audioData: audioData)
                
                try Task.checkCancellation()
                let responseText = try await client.promptChatGPT(prompt: prompt)
                
                try Task.checkCancellation()
                let data = try await client.generateSpeechFrom(input: responseText,voice: 
                        .init(rawValue: selectedVoice.rawValue) ?? .alloy)
                
                try Task.checkCancellation()
                try self.playAudio(data: data)
                
            }catch{
                if Task.isCancelled {return}
                state = .error(error)
                resetValues()
            }
        }
    }

If you made it all the way down here — 🙏🏽 thank you and we are almost done!

The last part of the project is to go back into our back-end file (ViewModel.swift) and create a function called processSpeechTask. processSpeechTask will do the following:

pass in the state to be .processingSpeech
generate an audio transcription, prompt ChatGPT, and generate Speech to take the app through TTS & ChatGPT and back to SpatialGPT.
Use the selected voice as an audio input source from Whisper, which is built into ChatGPT.

Now with the coding done, you can run the app, test, and debug any issues that might have come up!

That being said, I used a lot of different resources with this coding process, here are the best ones I got to give credit to:

(1) ⏯️ The Youtube Video: I used this as a guide for the app. followed every single line of code he wrote in this video.

(2) 👨🏽‍💻 Swiftful Thinking : I used this youtube channel to better understand swiftUI this time around so it would be a faster time to coding.

(3) 🤖 SwiftGPT: Used a new custom gpt called SwiftGPT from the gpt store that was much more refined when it came to explaining concepts in swift!

While building this, my own curiosity and ambition has led me to think of different ways I can really push the boundaries with visionOS. One of the greatest ideas I’ve had ever since this journey started was always something at the intersection of Music + VR. For those who know me personally, they know sikh music or gurmat sangeet has been a big part of my life…..

So, what if I could create an AI coach integrated w/ visionOS that could teach you instruments like the harmonium guitar hero style! This idea does not stop there, you could build out AI coaches + Mixed Reality Interfaces for a whole host of skills such as:

Learning how to build something off of an IKEA booklet
Learning how to cook something off of a Youtube Video
Learning how to paint something off of an old bob ross videos

With all these ideas in my head & meeting more amazing people in the visionOS dev community, I’ve been really happy to get some amazing opportunities such as f.inc’s vision pro residency starting tomorrow & the first AI + Vision Pro Hack in SF this weekend hosted by michael raspuzzi! I’m super excited to keep building in this space and just stay tuned for what comes next 👀

👋 Hey, i’m Piram and i’m an aspiring UI/UX designer, exploring how AI is changing the design world. connect with me on linkedin to follow me on this journey!