Siri Gets a Barista Job: Adding Offline Voice AI to a SwiftUI App
Siri has access to the world’s knowledge, but what if all I want is my favorite cup of coffee? The Speech framework offers cloud-computed transcription, but it seems like overkill to send audio data over the internet to a server farm that understands the entirety of the English language when all I need it to understand is how I take my coffee. Let’s see how we can achieve this with the offline voice recognition platform from Picovoice.
1— Create a Simple GUI
Before we dive into using the Picovoice platform, we need a UI to work with. SwiftUI has made it easier than ever to create visually appealing stateful UIs. In about a half-hour I was able to mock-up a GUI with a coffee maker image, some text prompts, and a collection of stateful buttons. This is what I came up with for a Barista app:
2 — Add the Picovoice Cocoapod
If you’re new to iOS development, you may be unfamiliar with Cocoapods. Cocoapods bring modern package management to iOS, allowing developers to add powerful extensions to their apps with minimal effort.
To install the Picovoice pod using Cocoapods, add the following to your Podfile:
source 'https://cdn.cocoapods.org/'// ...
pod 'Picovoice-iOS'
3 — Initialize the Voice AI
The Picovoice Platform contains two speech recognition engines: the Porcupine Wake Word engine and the Rhino Speech-to-Intent engine. The combination of these two engines allows us to create voice interactions similar to Alexa, but without sending your audio off-device. For instance, we could say:
Hey Barista, could I have a medium coffee?
The phrase ‘Hey Barista’ will be detected by Porcupine; Rhino will interpret the rest of the request that follows. Rhino uses a bespoke context to decode the command, without transcribing it to text. When the engine has made an inference, it returns an instance of an Inference
struct; for the above sample phrase, the struct will look like this:
IsUnderstood: true,
Intent: 'orderBeverage',
Slots: {
size: 'medium',
beverage: 'coffee'
}
In order to initialize the voice AI, we’ll need both Porcupine and Rhino model files. The wake word model file (.ppn) tells the Porcupine engine what phrase it is supposed to continuously listen for, while the context model file (.rhn) describes the grammar that the Rhino engine will use to understand natural voice commands specifically related to ordering coffee.
Picovoice has made several pre-trained Porcupine and Rhino models available on the Picovoice GitHub repositories[1][2]. For our Barista app, we’re going to use the trigger phrase Hey Barista
and the Coffee Maker
context, which understands a collection of basic coffee maker commands.
After downloading hey barista_ios.ppn
and coffee_maker_ios.rhn
, add them to the iOS project as a bundled resource so that we can load them at runtime.
We’ll also need a Picovoice AccessKey, which can be obtained by signing up for a free account on the Picovoice Console.
Then we can initialize the Picovoice Platform:
The method picovoiceManager.start()
starts audio capture and automatically passes incoming frames of audio to the voice recognition engines.
To capture microphone audio, we must add the permission request to the Info.plist
:
<key>NSMicrophoneUsageDescription</key>
<string>To recognize voice commands</string>
4 — Integrate Voice Controls
The best way to control SwiftUI from code-behind is to create a ViewModel and have the UI observe it. Our UI controls are simple: we want 1) some indication that the wake word has been detected and 2) to display our drink order. Create a struct to represent each button state and state variables to show and hide text; the UI will then be bound to these parameters because they use the Published
keyword. After adding these, our ViewModel will look like this:
We can now issue voice commands that alter the UI automatically, making our app entirely hands-free. Finally, Siri knows how I like my coffee.
The full source code from this tutorial can be found here. For more information regarding Picovoice SDKs and products, visit the website, docs or explore the GitHub repositories. If your project requires custom wake word or context models, sign up for the Picovoice Console.