iOS On-Device Speech Recognition

SFSpeechRecognizer has been updated in iOS13 to allow recognition and analysis of speech on device, data-free, and offline

Anupam Chugh
Nov 2 · 5 min read
Image by Becca Clark from Pixabay

Apple showcased its advancements in the field of Machine Learning and Artifical Intelligence during WWDC 2019 this year. One such feature that sheds light on their ambitions is on-device speech recognition in iOS 13.


Scope

On-device speech recognition increases the user’s privacy by keeping their data off the Cloud. Apple strives to give voice-based AI a major boost through this enhanced speech recognition.

The newly upgraded speech recognition API lets you do a variety of things, like tracking voice quality and speech patterns using the voice analytics metrics.

From providing automated feedback based on recordings to comparing the speech patterns of individuals, there’s so much you can do in the field of AI using on-device speech recognition.

Of course, there are certain trade-offs to consider with this new On-Device Speech Recognition. There is no continuous learning like you have on the Cloud. This can lead to less accuracy on the device. Moreover, the language support is limited to about 10 languages currently.

Nonetheless, on-device support lets you do speech recognition for an unlimited amount of time. A big win over the previous one minute per recording limit the server had.

is the engine that drives speech recognition.

iOS 13 SFSpeechRecognizer is smart enough to recognize punctuations in your speech.

Saying a dot adds a full stop. Similarly, a comma, dash, and question mark would return the respective punctuations in the transcription: (, — ?).


Our Goal

Developing an on-device speech recognition iOS application that transcribes live audio. An illustration of what we’ll achieve by the end of this article is given below:

Screengrab from our application.

Did you notice?

The above screengrab was taken in flight mode.

Without wasting any more time, let’s tap into the microphones and begin our journey toward building an on-device speech recognition application.

In the following sections, we’ll be skipping the UI and aesthetics part and dive straight into the speech and audio frameworks. Let’s get started.


Adding Privacy Usage Description

For starters, you need to include the privacy usage description for microphone and speech recognition in your as shown below.

Not adding this would certainly lead to a runtime crash.

Next, in your class to access the speech framework in your application.

Requesting permissions

We need to request authorization from the user in order to use speech recognition. The following code does that for you:

SFSpeechRecognizer.requestAuthorization{authStatus in

OperationQueue.main.addOperation {
switch authStatus {
case .authorized:
case .restricted:
case .notDetermined:
case .denied:
}
}
}

The is responsible for generating your transcriptions through . For this to happen, we must initialize our :

var speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en_IN"))

In the above code, you need to pass the locale identifier that’s used throughout your phone. It’s English(India) in my case.


Speech Recognition: Under the Hood

An illustration of how on-device speech recognition works is depicted below:

Speech Recognition Flow

As can be seen from the above illustration, there are four pillars on which any speech recognition application hangs:

We’ll see the roles each of these plays in building our speech recognition application in the next sections.


Implementation

Setting up the audio engine

The is responsible for receiving the audio signals from the microphone. It provides our input for speech recognition.

let audioEngine = AVAudioEngine()
let audioSession = AVAudioSession.sharedInstance()
try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
try audioSession.setActive(true, options: .notifyOthersOnDeactivation)

let inputNode = audioEngine.inputNode

inputNode.removeTap(onBus: 0)
let recordingFormat = inputNode.outputFormat(forBus: 0)
inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
self.recognitionRequest?.append(buffer)
}

audioEngine.prepare()
try audioEngine.start()

The above code installs a tap on the and sets the buffer size of the output.

Once that buffer size is filled(through your audio signals as you speak or record), it’s sent to the .

Now let’s see how the works with the and in order to transcribe speech to text.

Enabling on-device speech recognition

The following code enables on-device speech recognition on a phone:

recognitionRequest.requiresOnDeviceRecognition = true

Setting to would use the Apple Cloud for speech recognition.

Do note that on-device speech recognition works on iOS 13, macOS Catalina, and above devices only. It requires Apple’s A9 or new processor which is supported from iPhone6s and above devices in iOS.

Creating a speech recognition task

An SFSpeechRecognitionTask is used to run the with the . In return, it provides the result instance from which we can access different speech properties.

In the above code, a lot is happening. Let's break it down into pieces.

  • Firstly, we cancel any previous recognition tasks when is pressed.
  • Next, we create the recognition task using the SFSpeechRecognizer and recognition request.
  • Setting to true allows accessing intermediate results during each utterance.
  • returns the transcription with the highest confidence. Invoking property over it gives the transcribed text.
  • We can access other properties such as , , or .

SFVoiceAnalytics

is the newly introduced class that contains a collection of voice metrics for tracking features such as pitch, shimmer, and jitter from the speech result.

They can be accessed from the segments property of the transcriptions:

for segment in result.bestTranscription.segments {
guard let voiceAnalytics = segment.voiceAnalytics else { continue }

let pitch = voiceAnalytics.pitch
let voicing = voiceAnalytics.voicing.acousticFeatureValuePerFrame
let jitter = voiceAnalytics.jitter.acousticFeatureValuePerFrame
let shimmer = voiceAnalytics.shimmer.acousticFeatureValuePerFrame
}

Start recording and transcribing

Now that we’ve defined each of the four components, it’s time to merge the pillars in order to start recording and displaying the transcriptions in a . The following code snippet does that for you.


Conclusion

The above implementation steps should return an outcome similar to the screengrab that was there at the start of this article. The full source code of the application is available in this Github Repository.

That sums up on-device speech recognition in iOS 13 from my side. This new upgrade would be handy when used in tandem with Sound Classifiers and NLPs.

I hope you enjoyed reading this. Now start building your own Voice-based AI applications using the new .

Better Programming

Advice for programmers.

Anupam Chugh

Written by

I develop apps and write about them. Blogging weekly at iowncode.com

Better Programming

Advice for programmers.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade