Speaking Pace

Klaus Matzka

Published in

Apple Developer Academy | Federico II

9 min readApr 11, 2020

Improve Your Talking Speed in Real Time with Speaking Pace.

An iOS application that monitors your talking speed in real time and helps you to become a better speaker.

Public speaking is an art that needs rehearsing. Photo by Roché Oosthuizen (Pixabay).

Do you know what the best talking speed is for speaking in front of an audience, be it speaking in the family, in the office, at the university or in front of a big audience like TED?

What we know for sure is: Always speak slowly and in a clear way.

So only two questions remain to be answered:

What does slowly mean?
How fast are you talking actually?

The answer to question one is — according to our research — around 100 words per minute.

And to get an answer to the second question, here is a brand new tool for you: Speaking Pace!

Speaking Pace analyses your voice while speaking in real time, counts your words and shows your talking speed in large numbers and with a nice instrument graph.

Never loose your audience again by speaking to fast in public!

THE TECHNOLOGY

Speaking Pace has been developed by a team of three at the Apple Developer Academy in Naples, Italy. It is being built with the latest technologies available on Apple’s iOS, iPadOS and macOS platforms: CoreAudio, On-Device Speech Recognition, SwiftUI and Combine declarative frameworks.

Let us explain…

The Speech Engine — Core Audio & Speech Frameworks

We implemented live speech recognition that sends data to our Combine engine (see below). The process could be as simple as taking sound from the microphone and pass its buffer to SFSpeechRecognizerRequest object that is used for the recognition process. However, this was a delusion.

By default Apple's server-based speech recognition works only one minute. Luckily, iOS 13, iPadOS 13 and macOS Catalina support on-device recognition for several languages with unlimited recognition time.

After enabling it we discovered a major issue: When the user pauses for some time the framework clears the whole result string without any prompting. After going deeper into the speech framework we came out with a solution.

The Life of a Developer: “Once we got totally stuck and even published a question on StackOverflow (that still has no answer).”

The variable that holds the best result of the speech transcription has an array of segments, or words, in addition to the result string. Using this array our algorithm adds every new word to our Combine publisher (see below for more on Combine) after it appears for the first time in the segments array.

This might not be the most correct word or best transcription yet — the framework might decide on a better transcription when more speech context becomes available later on — but since we only need the word count the resulting value is close enough to a final one to be acceptable for our application. We keep publishing as long as the framework's new word quantity is greater than the word count we have summed up so far ourselves. When the recognition engine resets the result string value, we reset our previous word count to zero and start all over.

But this has not been the end of the fight, yet.

A Small Catalyst Challenge Thrown In For Free

The implementation worked well on the iPhone but the macOS Catalyst version worked only on newer Macs. Internet research did not give any hints on how to solve our specific issue. After several discussions with audio experts we had to accept that Catalyst has issues with the AVFoundation framework which is the recommended framework to use with the SFSpeechRecognizer.

We could not be sure that AVFoundation+Catalyst has been the source of our macOS issues but we decided to try another way to get audio into the speech framework. We choose the low-level framework AudioToolbox, a part of CoreAudio, and used one of its features called AudioQueue.

Converting from AudioQueue buffer to AVAudioPCMBuffer as input for the SpeechRecognizer.

Our needs were simple: Take sound data from the microphone, convert it into a particular type of buffer and provide it to the SFSpeechRecognizerRequest.

While the first part was rather easy as there were code examples of recording data into file, the conversion into AVAudioPCRBuffer has been a challenge. Using documentation and available code snippets this function has been rewritten three times. Once we got totally stuck and even published a question on StackOverflow (that still has no answer). After digging up stored away knowledge from the very back of our brains about old pain C and pointers and reading deep into audio formats, we rewrote the function one more time and finally it worked even on older Macs. Hooray!

The UI / UX Building Blocks

Speaking Pace is implemented in SwiftUI, Apple’s new UI framework. SwiftUI’s declarative nature combined with real-time previewing of the user interface code within the Xcode development environment lends itself perfectly well to rapid prototyping of the user interface.

*Speaking Pace: A simple SwiftUI View Hierarchy.*

A Whole Lot of Different Ways to Animate Views

Even in this rather simple user interface we are employing half a dozen different manifestations of SwiftUI’s animation powers. Let me briefly touch just one of them and explain what we have learned about its behaviour — Transitions.

Transitions determine how a view is inserted and removed from the visible view hierarchy. In case you want to create a custom view transitions you can do so by attaching the .transition() modifier to a view. Keep in mind, though, that transitions must be associated with an explicit animation to take effect.

Let’s have a look at the following Code snippets to explain what that means:

This transition is not being applied by the SwiftUI animation system when MyBeautifulView gets instantiated. Why? Because there is no animation associated with it:

The following transition is not working either, although there is an animation defined. But as of Xcode 11.2 the animation needs to be explicit, but here we use an implicit animation:

The following two code snippets show ways to do it. You could either modify the transition with an explicit animation…

… or wrap the self.show.toggle() that changes the @State property into a withAnimation() explicit animation block:

Implemented that way, any of the above code snippets would work.

Multi-Platform Out of The Box?

SwiftUI’s „Learn once, apply anywhere“ paradigm allowed us to deploy our iPhone version of the app on the iPad and through Apple’s Catalyst on the Mac simultaneously — in our case without any platform specific code modifications, using the same Xcode project and source code files for all three platforms.

But be aware of an important note Apple made when they announced SwiftUI:

SwiftUI is not a multi-platform framework, but instead a framework for creating apps on multiple platforms.

Isn’t this the same thing just worded differently? Not really. The difference is that SwiftUI works great on different platforms but needs platform specific code adaptions to build really great apps on each platform. For example, only watchOS knows about a Digital Crown, or there has been no right-click on earlier versions of iPadOS, but there always has been on macOS.

That said we are loving Catalyst and looking forward to using it for many more projects to come.

Combine

The main components of our app are the speech recognition engine, the data processing and the user interface.

We wanted to use Combine to connect these three main components in a loosely coupled way. So why loosely coupled? What is the advantage of a loosely coupled system?

Wikipedia states: “Components in a loosely coupled system can be replaced with alternative implementations that provide the same services.”

In addition to the above stated, a loosely coupled system architecture has important advantages for reusability and extensibility. And even testing components with Unit tests can be much easier, because you can put single components in a completely changed environment to validate their behaviour independent from other components.

With a strong coupled architecture our speech recognition data provider component would simply call a function of our data processing component.

With this strong dependency it is not possible to exchange the implementation class of the data processing without making changes to the data provider.

In loosely coupled systems this is usually solved with protocols or interfaces in between the implementations.

The protocol defines which methods must be implemented from the data processing and can be called from data provider.

There is no direct dependency between the implementations anymore.

With Combine we found an easy to use alternative to connect the main components in our app without any boilerplate code and without defining any protocols.

We simple used a Combine PassthroughSubject to decouple our components. The subject implements a Combine publisher which accepts the data from our data provider.

Our data processing component is now a Combine subscriber which connects to the subject to receive the data.

Combine provides an easy way to create publishers and subscribers for sending and receiving specific data between different parts of the system in a loosely coupled and type safe manner.

That alone is great, but wait, there is much more. Combine provides us with an huge zoo of operators which can be used to describe the processing and combination steps for our data in a compelling way. We managed to implement our whole data processing component with a combination of some of the predefined operators. We even found some repeated patterns and we were able to extract them to create some new reusable Combine operators.

One example is the .active() operator. Its function is shown in the marble diagram:

The operator generates a downstream value of “true” if it receives some data from the upstream publisher. It generates a value of “false” to indicate that it did not receive upstream data for a while.

Another example is the .slidingWindow() operator, which is shown in the next diagram:

This operator creates a new publisher which publishes an array of the last n received values from the upstream publisher.

Have a look at the source code of these and two additional reusable operators in our Gitlab repository. In addition we have created a graphical representation of all four operators and how they are integrated into our Combine-based app architecture on our Poster. See both links below.

TEST OUR APP PROTOTYPE NOW

Our Speaking Pace app for iPhone and iPad is on TestFlight now.

Become a better public speaker by regularly rehearsing your talking speed! Or even follow your live performances with Speaking Pace!

Try Speaking Pace on TestFlight Today!

Speaking Pace on TestFlight.

RESOURCES & FURTHER READING

Please, find below the links to articles, tutorials, and source code examples about Combine, the Speech Framework, SwiftUI Animations and our Gitlab repository with some Combine code snippets:

Using Combine, by Joseph Heck
SwiftUI Advanced Animations and (a lot) more, by Javier@The SwiftUI Lab
Recognizing Speech in Live Audio, Apple Developer Sample Code
AudioToolBox Framework, Apple Developer Documentation
Combine-based App Architecture and Reusable Operators, A Graphical Representation
Our Gitlab repository, With our Collection of Reusable Combine Operators

We hope you enjoyed reading this article and it will be helpful for your own iOS development learning experience. If you have any suggestions or improvements of any kind, please let us know! We would love to hear from you!

Thank you for your reading time!
Alexey, Klaus, Roland

Speaking Pace has been a project at the Apple Developer Academy in Naples, Italy.

Meet us at LinkedIn:

Alexey — https://www.linkedin.com/in/iamalexantonov
Klaus — https://linkedin.com/in/klausmatzka
Roland — https://www.linkedin.com/in/roland-schmitz-8683766

Speaking Pace

Written by Klaus Matzka