Using SFSpeechRecognizer offline, on device, and uninterrupted

Danny Bolella
Oct 2 · 7 min read

This project is dedicated to my Dad, who overcomes his own hearing loss every day and inspires me to write code that can improve accessibility in tech.

Recently, I got pretty pumped about the changes to SFSpeechRecognizer Apple put this year. The game-changer is that it now has the capability to run locally on the device: no internet connection needed. This means that:

  • Users no longer need to worry about using data when mobile
  • Privacy FTW — everything is done locally instead of going back and forth with some server out there
  • There’s less of a transcription delay, especially in realtime
  • No longer restricted to only transcribing one minute of audio for a limited number of times a day. It’s unlimited (or until your voice gives out or your battery dies)

This is big news for speech-to-text, accessibility, and speech technology in general.


Use Case

Having grown up in a household that usually watched TV with the closed captioning (CC) turned on, it made perfect sense to make that my focus to test SFSpeechRecognizer’s latest updates. When thinking about how CC is done, recorded programs usually have prepared text for airing, while live events, like sports or the news, have someone doing live transcribing.

The goal, then, would be to replace that system by making an app to transcribe video (in our case pre-recorded) in real-time. This has already been done on iOS, but this time it would be without any network delay, data usage, or restrictions of transcription length.

Also, as a final challenge, I wanted to do this in SwiftUI. It’s not a must for this project, but I’ve been using it for the past couple of months (check out my profile for all of my articles on SwiftUI) and wanted to experiment with working with AV.

Complete disclosure: I ended up building on other people’s work. What’s awesome about that was that they turned out to all be part of a chained evolution — building off each other. I ended up just being the latest iteration of that progress and the results turned out fantastic!

But just to be fair and give credit where it’s due, I’ll reference them by name and with links throughout this piece.


How It Works

Here’s the source code for the project if you want to follow along on Github or on your IDE. Otherwise, I’ll include relevant gists along the way.

The first thing was to understand how SFSpeechRecognizer works and what it needs to do its thing. The docs tell us that for real-time recognition we will need to create a SFSpeechAudioBufferRecognitionRequest that will take an audio buffer (either AVAudioPCMBuffer or CMSampleBuffer). We then pass our request as a parameter for recognitionTask and set a completion handler where we should expect to get a SFSpeechRecognitionResult.

private let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))!
private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
private var recognitionTask: SFSpeechRecognitionTask?

private func setupRecognition() {
let recognitionRequest = SFSpeechAudioBufferRecognitionRequest()

recognitionRequest.shouldReportPartialResults = true
recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { [weak self] result, error in
result!.bestTranscription.formattedString
}

self.recognitionRequest = recognitionRequest
}

Simple enough so far. Now I just needed to find the audio buffer for the video, which, as it turns out, was not so simple.


Tapping the A from V in an AVAsset

There was quite a trail to follow in order to connect the dots/repos. Photo by William Iven on Unsplash

My journey figuring out how to access the buffer had started with this repo by Sash Zats. This was the first project I found that used SFSpeechRecognition on a video buffer. As I explored his work, he left a comment about how he got help from a few Apple engineers to modify an MTAudioProcessingTap Sample by Apple (written in Obj-C) to get a CMSampleBuffer, which he then passed to a delegate to be consumed.

This seemed like overkill, especially since his solution uses an AVPlayer implementation. So I looked a bit further and found this repo by An Tran. He created a cleaned-up version of the Zats repo, but his notes mentioned he hoped to create a Swift version of the AudioTap and provided a link to a few possible solutions.

One of those links intrigued me. It was to a gist created by Omar Juarez. Skimming through, it looked like a Swifty version of the modified tap called VideoMediaInput that still provided a CMSampleBuffer. While I was hoping there was a solution that would give me a AVAudioPCMBuffer, it felt like this was the best answer I was going to get.

Feeling like I had my pieces to the puzzle, I experimented by swapping out the Obj-C tap for the Swift one instead, which would be preferable if I were to ultimately put this into SwiftUI. The results were a success and can be found here.

//*********went from this*********
let asset = AVURLAsset(url: url)
guard let audioTrack = asset.tracks(withMediaType: AVMediaType.audio).first else {
print("can't get audioTrack")
return
}
playerItem = AVPlayerItem(asset: asset)

tap = MYAudioTapProcessor(audioAssetTrack: audioTrack)
tap.delegate = self

player.insert(playerItem, after: nil)
player.currentItem?.audioMix = tap.audioMix
player.play()

// Player view
let playerView: UIView! = view
playerLayer.player = player

//*********to this*********
vmInput = VideoMediaInput(url: url, delegate: self)

// Player view
let playerView: UIView! = view
playerLayer.player = vmInput.player

Closing the Swift Loop

Now that I had the guts, it was time to put it into SwiftUI. After running across Chris Mash’s series on AVPlayer & SwiftUI, I followed his article and code in this repo to build out a video player with controls. After doing some extracting and moving around, VideoMediaInput was then injected into the PlayerContainerView by replacing the AVPlayer with the one in the tap.

The last piece of the puzzle was consuming the buffer from the delegate to be transcribed. By making a ClosedCaptioning class that follows the protocol, I have the buffer append directly to the recognizer for processing.

For binding, I also have ClosedCaptioning conform to the ObservableObject protocol. Inside is a captioning property with the Published tag. This makes it bindable with my SwiftUI code simply by tagging the instance of my class with ObservedObject. With my Text view set to that property, the binding will update it every time we get a new result and, finally, displaying our realtime closed captioning.


Results

After all that research, learning, trialing, and work, I finally had my app. I plugged in an old Apple Ad featuring Jeff Goldblum from the 90s and tested it out. The results were… decent.

My man Jeff talking about how not to be left out of the e-mail party. 90s, amiright?

While Jeff has his signature way of talking which could be sometimes scattered, SFSpeechRecognizer did a pretty good job following along. When I put my phone on airplane mode, I also got the same results, proving that we were, indeed, running locally. The last test was having it loop more than twice to see if transcription was interrupted after a minute, which was also successful.

Realtime Closed Captioning of Video: Complete


However…

The quality of the transcription was not 100% spot on. There were a few words wrong or missing, though not enough to lose the context or be unable to fill in the gaps mentally, especially while listening.

To simulate running the app as a deaf user, though, I muted my phone and read the transcription while watching. Having the text appear on top of the video did help a little in terms of syncing timing and gesturing. Unfortunately, the lack of editing for contextual grammar and punctuations was sorely missed. Muted transcription made it difficult to understand where Jeff was going with his words, the manner by which they were being told, and turned everything into a thirty-second run-on sentence.

Without audio nor grammatical assistance, transcriptions could get the words right but still lack contextual sense. Photo by Raphael Schaller on Unsplash

While having speech-to-text at all is a huge accessibility achievement, it still doesn’t hold up to closed captioning just yet. While running on the server may have helped correct some of the words, it would still fall short for users without audio or grammatical assistance or context.


A New Hope

That’s where the new SFAudioAnalytics and Sound Classification in Core ML/Create ML comes in. These new capabilities can (and most certainly will) fill in the gaps.

For example, using a combination of the analytics in conjunction with a trained ML Model, we could probably determine punctuation (including periods, exclamations, and question mark) in transcriptions. The context punctuation gives is incredibly valuable. But then imagine taking it further by processing and displaying inflections, tone, or emotion through font! The possibilities are staggering and a great path forward for speech recognition.

My hope is to work on some of those ideas into this project, but also to see other devs jump in on the action.

Great thanks and credit to Apple, Sash Zats, An Tran, Omar Juarez, and Chris Mash , whose works were instrumental in putting this passion project together.

Better Programming

Advice for programmers.

Danny Bolella

Written by

Software Engineer | Scrum Master | Writer | Reader | Husband/Father

Better Programming

Advice for programmers.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade