Click and listen!

Listen to text from images using Vision and AVFoundation frameworks in iOS.

Swamita Gupta

Published in

CodeChef-VIT

4 min readMar 24, 2021

Listening always helps us understand faster than reading.

“Listening is a magnetic and strange thing, a creative force.” — Karl A. Menniger

We have all had moments when we are tired of straining our eyes and wished someone would read out some text, a book, or even our notes for us. The procedure behind this involves identifying alphabets and numbers from a given image, stringing them together to form sentences and phrases and then converting the fetched text to speech. All of this can become quite tiresome.

Thankfully, this can be easily implemented using the Vision and AVFoundation frameworks that Apple provides. Let us explore how it works!

Here, we will be building a simple app ‘Read Aloud’ step by step, that accurately reads aloud the text in a clicked image.

Let us first set up how our interface will look. In Main.Storyboard, I have added a UIImageView to display the image, a textView to display the detected text and two buttons, to click a picture and to read the text respectively. I have also added a label to make the app informative.

You can drag these components into our view controller scene from the objects and give them appropriate constraints.

Let us now link these components in ViewController.swift, import Vision and AVFoundation frameworks and add respective delegates as shown below.

Feel intimidated by the numerous lines of code written above? Let us walk through them now.

First, we have imported the modules Vision (required to detect text from our image) and AVFoundation (to convert text to speech)
We have then linked our imageView and textView to IBOutlets and both the buttons to IBAction functions respectively.
We have added delegates AVSpeechSynthesizerDelegate, UIImagePickerControllerDelegate and UINavigationControllerDelegate to call their inbuilt functions and declared imagePicker, synthesizer and request as shown above.
We have set the delegate of imagePicker and synthesizer to self (ie, the current class) in the viewDidLoad method.
We have also specified the source type of imagePicker as .camera, as we will be clicking real pictures books and texts using the camera for our app. You can set it to .photoLibrary too if you want to import images from your gallery.

Note: You can only run your app on physical devices like iPhones if you choose the source type of imagePicker as .camera, as simulators do not support it.

In our function cameraTapped, we call the imagePicker to click an image.

In our function speakerTapped, we define utterance as the speech utterance of the text displayed in textView, specify its rate as 0.5, its language as english (US) and call the synthesizer to speak or read the text aloud.

Now that we have coded the backbone of our app, let us perform its functionality of detecting text from images.

We will now write the didFinishPickingMediaWithInfo method of imagePicker as follows.

Handling the image obtained from imagePicker

Here,

We fetch the image clicked as an UIImage.
Display the image in our imageView.
Clear out the pre-existing text in our textView.
Call a method to recognise text from the image.

Let us now define the function recognizeText with the image as an argument to detect text from our clicked images.

Function to recognise text from the given image

VNRecognizeTextRequest from Apple’s vision module detects the possible text instances of an image and returns them as observations.

We then choose the observation with the maximum probability and append it to the string imageText.

We then update the text shown in the textView to our fetched imageText from the main thread.

For better results, we have also set the recognitionLevel metric to accurate, language to english and allowed language correction and detection of new, custom words.

Finally, on the global (background thread) we fetch observations by handling our image request.

When the recognizeText function is called, it will update the textView with the text visible in the image. And then, when we tap the speaker button, it will read aloud the text, making reading books much easier for us!

Voila! The app is done. Run it now and hear any text you want by just one click.

Check out the GitHub repo for source code!

Click and listen!

Listen to text from images using Vision and AVFoundation frameworks in iOS.

Written by Swamita Gupta