Photo courtesy of Unsplash.com

Speech Recognition with Swift & iOS 10

In this post I’ll explain how to implement Speech Recognition using Apple’s new Speech framework for Swift. In the fall of 2016 with the release of iOS 10, the Speech framework was made available and provided a way to implement continuous speech detection and transcription from anywhere within your app.

Prior to this release some speech recognition functionality was available. Users had experienced mainly while using keyboard dictation assistance or Siri. But thanks to this new API, the capability of integrating speech recognition in creative way throughout your app has greatly increased.

Take note: The Speech API allows you to recognize live and pre-recorded audio speech. In this post I’ll explain how to recognize live speech. Recognizing pre-recorded audio requires managing the audio files by URL. I’ll explain this later.

Another important detail to note: There is a limit of about 1 minute of processed recognition per recognition task. So that means that you can get recognition in bursts of about 1 continuous minute.

Speech recognition relies on Apple’s servers to function. And as stated in the documentation: “In the case of speech recognition, … data is transmitted and temporarily stored on Apple’s servers to increase the accuracy of recognition.” So the amount of usage can be restricted if it requires heavy computation or storage.

Because speech is transmitted and uses Apple’s remote servers, security is a concern. For this reason your user must agree to have their speech detected by your app and must be made aware that what they say during recognition could be at risk. You can see complete information here: Security and Privacy Enhancements.

Just remember, you should provide clear ways of letting your user know when their speech is being detected so they can avoid speaking any sensitive or private information.

Keeping that in mind, let’s get started with the step-by-step tutorial. Our first step after the basic project set-up will be prompting user authorization.

Project Set Up

  1. Open a project where you would like to implement speech recognition.

For this demonstration I created a project that will do two things: 1. transcribe detected speech into text via a UILabel and 2. initiate a simple UI change when specific words are uttered.

Set up a view controller with a label to capture the text and/or some UI to respond to speech recognition.

Set up a view controller in storyboard that contains at minimum a UILabel that will show the spoken text. Optionally, you can also set up a UIButton to begin the speech detection, along with another UILabel that will let the user know how to interact with you app.

I set the height of the UILabel (that will show the spoken words) to 40% of the screen height and width to 90% of the screen. I also set the number of possible lines to 10 in order to show a fair amount of the detected text.

I’ve set up a UIView at the bottom of the view controller that will react when the names of color are detected and change to the corresponding color.

Before moving to the next step, connect the labels, button and view to your view controller class using the Assistant Editor.

2. Next, go to your view controller. Import the Speech framework:

import Speech .

There are some handy speech recognizer delegate methods that you may want to use later on, so make the view controller adhere to the SFSpeechRecognizerDelegate.

Your code should look like this so far:

User Permission Before Initiating Speech Recognition

Now that we have the basic components of the project set up, let’s set up the user permission prompt.

3. Go to your Info.plist file. Add two keys: NSSpeechRecognitionUsageDescription and NSMicrophoneUsageDescription. These will general alerts which ask your user for permission to use speech recognition and for the app to access the microphone. Add a sentence to each of these explaining the purpose of the speech recognition to your user, as a String value. For this demo it could be “Speak and watch your words become text or say a color to see the box change colors.” and “Your speech will be detected after tapping the start button.”

It will end up looking like this:

Note: the word “Privacy -” will appear automatically once you type in NSSpeechRecognitionUsageDescription & NSMicrophoneUsageDescription as new keys.

The speech recognition permission and the microphone permission prompts will now appear automatically when your app uses any API involving speech recognition and tried to access the microphone.

Because we are using the Speech framework in our main view controller the prompt will happen immediately as the app opens and once the start button is tapped.

Once the user has agreed to let the app access the microphone and recognize their speech, the prompts will not be shown again.

Speech Recognition Implementation

4. Next declare four variables in the view controller class:

First, an instance of the AVAudioEngine class. This will process the audio stream. It will give updates when the mic is receiving audio.

let audioEngine = AVAudioEngine()

Second, an instance of the speech recognizer. This will do the actual speech recognition. It can fail to recognize speech and return nil, so it’s best to make it an optional.

let speechRecognizer: SFSpeechRecognizer? = SFSpeechRecognizer()

Side Note: By default, the speech recognizer will detect the devices locale and in response recognize the language appropriate to that geographical location. The default language can also be set by passing in a locale argument and identifier. Like this: let speechRecognizer: SFSpeechRecognizer(locale: Locale.init(identifier: "en-US")) .

Third, recognition request as SFSpeechAudioBufferRecognitionRequest. This allocates speech as the user speaks in real-time and controls the buffering. If the audio was pre-recorded and stored in memory you would use a SFSpeechURLRecognitionRequest instead.

let request = SFSpeechAudioBufferRecognitionRequest()

Fourth, an instance of recognition task. This will be used to manage, cancel, or stop the current recognition task.

var recognitionTask: SFSpeechRecognitionTask?

The Recognizer Method

5. Now let’s write the method that will perform the speech recognition. It will record and process the speech as it comes in.

Make an empty function with no parameters or return value and call it recordAndRecognizeSpeech().

6. Inside the body add the set up for the audio engine and speech recognizer. This is how it looks:

For a deeper explanation of what is happening here check out the documentation for audioEngine. To summarize, audio engine uses what are called nodes to process bits of audio. Here .inputNode creates a singleton for the incoming audio. As stated by Apple: “Nodes have input and output busses, which can be thought of as connection points. For example, an effect typically has one input bus and one output bus. A mixer typically has multiple input busses and one output bus.” InstallTap configures the node and sets up the request instance with the proper buffer on the proper bus.

7. Next, prepare and start the recording using the audio engine. The “Do-catch” statement is useful for error checking/handling, but this can be done other ways.

8. Then, make a few more checks to make sure the recognizer is available for the device and for the locale, since it will take into account location to get language.

You also want to fully handle the potential errors occurring with alert messages or other UI.

9. Next, call the recognitionTask method on the recognizer. This is where the recognition happens. As I mentioned before, the audio is being sent to an Apple server then comes back as a result object with attributes.

10. Assign the result to a label or variable.

For this example, we’ll assign the incoming text to the detectedTextLabel. It will display the results of the recognition task as words.

11. Use: result.bestTranscription.formattedString to format the result as a string value.

This string value will show all of the words that have been said and recognized so far. This will give the appearance that the new words are being appended to the main string without needing to do any acutall appending behind the scenes.

12. Next in the startButtonTapped IBAction, call recordAndRecognizeSpeech().

The basic functionality should now be working! Test your code on a real device. Simulator does not provide access to the microphone on OS.

Speak and you should see the text appearing as the detectedTextLabel when you speak into the microphone.

Congratulations, you just used the speech recognition framework to capture your speech!

Demo video for speech recognition.

Word Recognition to Make a UI Change

Let’s take it one step further. Let’s identify a specific word and trigger a UI change to indicate when that chosen word has been uttered.

For this example, I want to change the color of the UIView I set up in the beginning to the name of a color spoken. For brevity we’ll stick to the basic colors and use UIColor’s stock colors.

We want to isolate and check only the last spoken substring in the result(not the whole string) because we might say several color names and if we check the string for certain color names multiple colors would appear.

In the result block we’ll need a way to identify each word separately.

13. for segment in result.bestTranscription.segments { makes a loop to analyze each new segment/result string.

14.let indexTo = bestString.index(bestString.startIndex, offsetBy: segment.substringRange.location) makes a substring range that is everything up until beginning of the last spoken word.

15.lastString = bestString.substring(from: indexTo) makes a substring that is from that index to the end of the result.

16. self.checkForColorsSaid(resultString: lastString) Lastly, outside the loop call the color checking method(shown below) and input the last string as an argument that will be checked.

Your complete recognitionTask method call should look like this:

And your fully completed recordAndRecognizeSpeech() method should now look like this:

17. Write the implementation of checkForColorsSaid method which changes the background color of the UIView when it receives a color String.

Use a simple switch statement, like this:

The background color of the colorView will be assigned to that color.

14. Now run your app. You should be able to change the UIView’s color by saying a color name from your switch statement.

Recognizing spoken color names.

Checking for Speech Recognition Authorization

Another useful functionality is to be able to check for speech recognition authorization any time from anywhere in your app.

If, for any reason, speech recognition is not available, you want to be able to check for it at will, and keep your user from accessing functionality that won’t make sense without it. It’s useful for enabling or disabling buttons or UI that relate to your purpose for using speech recognition.

To check for authorization SFSpeechRecognizer has a requestAuthorization method which returns an authorization status. It also will prompt the speech recognition permission alerts, if they have already not been prompted from the plist. Note: any UI changes that relate to authorization status should be run on the main queue, so that you interfaces don’t freeze up.

The switch it offers an opportunity to handle buttons or other UI according to the appropriate case if the user has authorized, denied, restricted or has not determined whether to use speech recognition.

The requestAuthorization implementation.

It’s a good idea to handle all cases fully and not let your app become unusable or confusing the user denies usage speech recognition.

Lastly, call the requestSpeechAuthorization method in the viewDidLoad function.

The method call for requestAuthorization.

Helpful SFSpeechRecognitionTaskDelegate Methods

Apple has provided some helpful delegate methods along with speech recognition. I didn’t use these in this demo, but they are worth checking out.

One in particular, speechRecognitionTaskFinishedReadingAudio(SFSpeechRecognitionTask) tells the delegate when the task is no longer accepting new audio input, even if final processing is in progress. It might be nice to add UI letting the user know when audio is no longer being accepted. And remember there is a limit of about 1 minute where audio is accepted continuously.

Check out the additional delegate methods to further customize speech recognition. Lastly, don’t forget to add UI to let users know when and why they’re speech is being recognized.

I hope this has helped you implement speech recognition in your apps. If so, please comment below and refer this post!

Check out the complete project files here on Github!

Thank you for reading!

Resources/Additional Reading:

https://developer.apple.com/reference/speech

https://developer.apple.com/reference/speech/sfspeechrecognizer

https://developer.apple.com/reference/speech/sfspeechrecognitiontaskdelegate

https://developer.apple.com/library/prerelease/content/releasenotes/General/WhatsNewIniOS/Articles/iOS10.html#//apple_ref/doc/uid/TP40017084-DontLinkElementID_3

https://github.com/jhuerkamp/SpeechRecognizerDemo

https://code.tutsplus.com/tutorials/using-the-speech-recognition-api-in-ios-10--cms-28032

http://www.appcoda.com/siri-speech-framework/