Scene Text Recognition in iOS 11

11 min readSep 11, 2017

Note: This post originally appeared here.

With the release of iOS 11 this year, Apple released many new frameworks and Vision framework is one of them. Vision framework enables app developers to implement any task that involves computer vision without even detailed knowledge of the subject. This includes facial analysis (smile, frown etc), bar-code detection, scene image classification, object detection, tracking etc.

In this post we will focus our efforts in scene text recognition. Scene text is text that can appear anywhere in environment like signs, store fronts, notice boards etc. There are two parts of it, first scene text detection and second is scene text recognition. Detection is, as the name implies, to find if there is any text present in image and recognition is, what actually is written in that text. Our app should be able to recognize text with varying conditions and it does not depends on a particular font, text size or color.

Project Setup

For this tutorial, we need Xcode 9.0 and an iPhone or iPad running iOS 11. As of writing, both Xcode 9.0 and iOS 11 are in beta.

Launch Xcode 9.0 and click on Create a new project, select iOS tab and choose Single View App. Hit Next and enter product name (I have used SceneTextRecognitioniOS), in language option select Swift and save the project in some directory.

Hit Command + R to run the app, it will compile and you will see a white screen.

Create a new swift file and name it Preview.swift and write the code mentioned below:

import UIKit
import AVFoundationclass PreviewView: UIView {var videoPreviewLayer: AVCaptureVideoPreviewLayer {
    guard let layer = layer as? AVCaptureVideoPreviewLayer else {
        fatalError("Expected `AVCaptureVideoPreviewLayer` type for layer. Check PreviewView.layerClass implementation.")
    }
    return layer
}var session: AVCaptureSession? {
    get {
        return videoPreviewLayer.session
    }
    set {
        videoPreviewLayer.session = newValue
    }
}// MARK: UIView
override class var layerClass: AnyClass {
    return AVCaptureVideoPreviewLayer.self
}}

Please note, this view is using AVCaptureVideoPreviewLayer as a backing layer.

Open Main.storyboard and click on View Controller Scene, expand it and select View of ViewController. From utilities pane, select identity inspector and set class field to PreviewView (you should see auto complete if everything goes well).

Open Info.plist file and add Privacy - Camera Usage Description key with value Recognize Scene Text. Also add Vision framework to your project.

Below is the complete ViewController.swift file.

import AVFoundation
import UIKit
import Visionclass ViewController: UIViewController {
override func viewDidLoad() {
    super.viewDidLoad()
    // Do any additional setup after loading the view, typically from a nib.
    if isAuthorized() {
        configureTextDetection()
        configureCamera()
    }
}
override func didReceiveMemoryWarning() {
    super.didReceiveMemoryWarning()
    // Dispose of any resources that can be recreated.
}
private func configureTextDetection() {
    textDetectionRequest = VNDetectTextRectanglesRequest(completionHandler: handleDetection)
    textDetectionRequest!.reportCharacterBoxes = true
}
private func configureCamera() {
    preview.session = session
    let cameraDevices = AVCaptureDevice.DiscoverySession(deviceTypes: [.builtInWideAngleCamera], mediaType: AVMediaType.video, position: .back)
    var cameraDevice: AVCaptureDevice?
    for device in cameraDevices.devices {
        if device.position == .back {
            cameraDevice = device
            break
        }
    }
    do {
        let captureDeviceInput = try AVCaptureDeviceInput(device: cameraDevice!)
        if session.canAddInput(captureDeviceInput) {
            session.addInput(captureDeviceInput)
        }
    }
    catch {
        print("Error occured \(error)")
        return
    }
    session.sessionPreset = .high
    let videoDataOutput = AVCaptureVideoDataOutput()
    videoDataOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "Buffer Queue", qos: .userInteractive, attributes: .concurrent, autoreleaseFrequency: .inherit, target: nil))
    if session.canAddOutput(videoDataOutput) {
        session.addOutput(videoDataOutput)
    }
    preview.videoPreviewLayer.videoGravity = .resize
    session.startRunning()
}
private func handleDetection(request: VNRequest, error: Error?) {
    
    guard let detectionResults = request.results else {
        print("No detection results")
        return
    }
    let textResults = detectionResults.map() {
        return $0 as? VNTextObservation
    }
    if textResults.isEmpty {
        return
    }
    DispatchQueue.main.async {
        self.view.layer.sublayers?.removeSubrange(1...)
        let viewWidth = self.view.frame.size.width
        let viewHeight = self.view.frame.size.height
        for textResult in textResults {
            guard let rects = region.characterBoxes else {
                return
            }
           var xMin = CGFloat.greatestFiniteMagnitude
           var xMax: CGFloat = 0
           var yMin = CGFloat.greatestFiniteMagnitude
           var yMax: CGFloat = 0
           for rect in rects {
               xMin = min(xMin, rect.bottomLeft.x)
               xMax = max(xMax, rect.bottomRight.x)
               yMin = min(yMin, rect.bottomRight.y)
               yMax = max(yMax, rect.topRight.y)
           }
           let x = xMin * viewWidth
           let y = (1 - yMax) * viewHeight
           let width = (xMax - xMin) * viewWidth
           let height = (yMax - yMin) * viewHeight
    
           let layer = CALayer()
           layer.frame = CGRect(x: x, y: y, width: width, height: height)
           layer.borderWidth = 2
           layer.borderColor = UIColor.red.cgColor
           view.layer.addSublayer(layer)
        }
    }
}
private var preview: PreviewView {
    return view as! PreviewView
}
private func isAuthorized() -> Bool {
    let authorizationStatus = AVCaptureDevice.authorizationStatus(for: AVMediaType.video)
    switch authorizationStatus {
    case .notDetermined:
        AVCaptureDevice.requestAccess(for: AVMediaType.video,
                                      completionHandler: { (granted:Bool) -> Void in
                                        if granted {
                                            DispatchQueue.main.async {
                                                self.configureCamera()
                                                self.configureTextDetection()
                                            }
                                        }
        })
        return true
    case .authorized:
        return true
    case .denied, .restricted: return false
    }
}
private var textDetectionRequest: VNDetectTextRectanglesRequest?
private let session = AVCaptureSession()
}
extension ViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
// MARK: - Camera Delegate and Setup
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
    guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else {
        return
    }
    var imageRequestOptions = [VNImageOption: Any]()
    if let cameraData = CMGetAttachment(sampleBuffer, kCMSampleBufferAttachmentKey_CameraIntrinsicMatrix, nil) {
        imageRequestOptions[.cameraIntrinsics] = cameraData
    }
    let imageRequestHandler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, orientation: CGImagePropertyOrientation(rawValue: 6)!, options: imageRequestOptions)
    do {
        try imageRequestHandler.perform([textDetectionRequest!])
    }
    catch {
        print("Error occured \(error)")
    }
}
}

In configureTextDetection method, we are creating a text detection request. In configureCamera method we are setting current session to our preview layer. Then we iterate through available cameras and pick the one that is on the back side and can capture video. After that we add this to our session. Next we create a video output capture, set its delegate and dispatch queue to use and add it to session. Finally, we start the session to run.

Another important method to consider is handleDetection. Here we are dealing with the detection results coming from vision framework. We are extracting the detected text regions and draw a bounding box around them on screen.

captureOutput(:didOutput:from) is method where we receive the camera stream and forward it to our text detection request.

Now run this code on your iOS device, point your camera at any text available nearby and you will see rectangles drawn around detected text.

Text Recognition

Now lets head over to text recognition. For this we will use Tesseract. Tesseract is an open source and most widely used OCR (Optical Character Recognition) engine that is used by many researchers and industrial products. For iOS we will use tesseract for iOS. Please download a zip file from here (taken from Raywenderlich Tesseract OCR Tutorial which can be found here ). Extract the zip and it contains TesseractOCR.framework and tessdata folder. Drag and drop TesseractOCR.framework from finder into your Xcode project. In Xcode, click on project file, select your app target, select Build Phases from top option, in Link Binary With Libraries and click + and select TesseractOCR.framework which we just added to project. Also add CoreImage.framework in Link Binary With Libraries, this framework is part of iOS SDK. Also add libstdc++.6.0.9.dylib or libstdc++.6.0.9.tbd, if libstdc++.6.0.9.dylib is not available.

Click Build Settings, find Other Linker Flags and append -lstdc++

Find Enable Bitcode and set its value to No(Bitcode is part of Apple thinning process. What it does is generate an intermediate representation of compiled code. If an app is upload on App store with bitcode enabled, at install time Apple will make some optimisation on the fly for that target user device. More info on bitcode here)

Tesseract needs some more files in order to run properly. There is a folder inside zip named tessdata, which is also needed to make sure it is included in your project. Unfortunately due to some unavoidable reasons this cannot be done in a straight forward manner in Xcode 9. Below are the steps that are needed to be performed in order to make tessdata available properly in project.
In XCode 9, Reference folder is not automatically supported. To use it, follow these steps (These are taken from GitHub):

Create a tessdata folder into the project folder (open the project folder on Finder and create the tessdata folder at the same level of the AppDelegate.swift file). Do not add this folder into the XCode project.
Add your Tesseract trained data files into the tessdata folder.
Right click on YourProject.xcodeproj and click on Show Package Contents
Double click on project.pbxproj (open it with XCode)
Add into the /* Begin PBXBuildFile section */:

64E9ADE81F2A2436007BAC6F /* tessdata in Resources */ = {isa = PBXBuildFile; fileRef = 64E9ADE71F2A2436007BAC6F /* tessdata `*/; };

Add into the /* Begin PBXFileReference section */:

64E9ADE71F2A2436007BAC6F /* tessdata */ = {isa = PBXFileReference; lastKnownFileType = folder; path = tessdata; sourceTree = ""; };

Add into the /* Begin PBXGroup section */, at the same level of the `AppDelegate.swift`:

64E9ADE71F2A2436007BAC6F /* tessdata */,

Add into the /* Begin PBXResourcesBuildPhase section */, at the same level of the Main.storyboard in Resources:

64E9ADE81F2A2436007BAC6F /* tessdata in Resources */,

Save the file
Open the project in XCode 9

Since Tesseract for iOS is developed with Objective-C ,we need Objective-C bridging header to use in our swift app. The best way to add bridging header and all associated project settings is to add a Objective-C file to the project. Create a new file and name it as Dummy and remember to select language to Objective-C. A prompt will appear that says Would you like to configure an Objective-C bridging header? Select Yes

Add below line to bridging header:

#import <TesseractOCR/TesseractOCR.h>
Add the below mentioned two instance variables in ViewController class:

private var textObservations = [VNTextObservation]()
private var tesseract = G8Tesseract(language: "eng", engineMode: .tesseractOnly)

In the first line, we are creating a VNTextObservation array to cache results from detection. Next, we are creating a tesseract object. This will be used for text recognition. In handleDetection method add textObservations = textResults as! [VNTextObservation], just above DispatchQueue.main.async line. Below is the updated handleDetection method:

private func handleDetection(request: VNRequest, error: Error?) {
    
    guard let detectionResults = request.results else {
        print("No detection results")
        return
    }
    let textResults = detectionResults.map() {
        return $0 as? VNTextObservation
    }
    if textResults.isEmpty {
        return
    }
    textObservations = textResults as! [VNTextObservation]
    DispatchQueue.main.async {
        
        guard let sublayers = self.view.layer.sublayers else {
            return
        }
        for layer in sublayers[1...] {
            if (layer as? CATextLayer) == nil {
                layer.removeFromSuperlayer()
            }
        }
        let viewWidth = self.view.frame.size.width
        let viewHeight = self.view.frame.size.height
        for result in textResults {            if let textResult = result {
                
                let layer = CALayer()
                var rect = textResult.boundingBox
                rect.origin.x *= viewWidth
                rect.size.height *= viewHeight
                rect.origin.y = ((1 - rect.origin.y) * viewHeight) - rect.size.height
                rect.size.width *= viewWidth                layer.frame = rect
                layer.borderWidth = 2
                layer.borderColor = UIColor.red.cgColor
                self.view.layer.addSublayer(layer)
            }
        }
    }
}

One thing to note in the above method is, previously we were removing all sub-layers and now we are removing all those layers which are not CATextLayer. The reason is that we will be adding many CATextLayer object once we got something in text recognition.

Now lets go to actual work of recognition. Open up captureOutput method and in the end, we will be doing recognition through tesseract.

The idea is, we will crop the image at the position where text is detected, given that cropped image to tesseract for recognition and show the results we got from tesseract on screen. Note that we may get many regions in image where text is detected.

Below is the complete implementation of captureOutput method:

func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
    guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else {
        return
    }
    var imageRequestOptions = [VNImageOption: Any]()
    if let cameraData = CMGetAttachment(sampleBuffer, kCMSampleBufferAttachmentKey_CameraIntrinsicMatrix, nil) {
        imageRequestOptions[.cameraIntrinsics] = cameraData
    }
    let imageRequestHandler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer, orientation: CGImagePropertyOrientation(rawValue: 6)!, options: imageRequestOptions)
    do {
        try imageRequestHandler.perform([textDetectionRequest!])
    }
    catch {
        print("Error occured \(error)")
    }
    var ciImage = CIImage(cvPixelBuffer: pixelBuffer)
    let transform = ciImage.orientationTransform(for: CGImagePropertyOrientation(rawValue: 6)!)
    ciImage = ciImage.transformed(by: transform)
    let size = ciImage.extent.size
    var recognizedTextPositionTuples = [(rect: CGRect, text: String)]()
    for textObservation in textObservations {
        guard let rects = textObservation.characterBoxes else {
            continue
        }
        var xMin = CGFloat.greatestFiniteMagnitude
        var xMax: CGFloat = 0
        var yMin = CGFloat.greatestFiniteMagnitude
        var yMax: CGFloat = 0
        for rect in rects {
            
            xMin = min(xMin, rect.bottomLeft.x)
            xMax = max(xMax, rect.bottomRight.x)
            yMin = min(yMin, rect.bottomRight.y)
            yMax = max(yMax, rect.topRight.y)
        }
        let imageRect = CGRect(x: xMin * size.width, y: yMin * size.height, width: (xMax - xMin) * size.width, height: (yMax - yMin) * size.height)
        let context = CIContext(options: nil)
        guard let cgImage = context.createCGImage(ciImage, from: imageRect) else {
            continue
        }
        let uiImage = UIImage(cgImage: cgImage)
        tesseract?.image = uiImage
        tesseract?.recognize()
        guard var text = tesseract?.recognizedText else {
            continue
        }
        text = text.trimmingCharacters(in: CharacterSet.newlines)
        if !text.isEmpty {
            let x = xMin
            let y = 1 - yMax
            let width = xMax - xMin
            let height = yMax - yMin
            recognizedTextPositionTuples.append((rect: CGRect(x: x, y: y, width: width, height: height), text: text))
        }
    }
    textObservations.removeAll()
    DispatchQueue.main.async {
        let viewWidth = self.view.frame.size.width
        let viewHeight = self.view.frame.size.height
        guard let sublayers = self.view.layer.sublayers else {
            return
        }
        for layer in sublayers[1...] {
            
            if let _ = layer as? CATextLayer {
                layer.removeFromSuperlayer()
            }
        }
        for tuple in recognizedTextPositionTuples {
            let textLayer = CATextLayer()
            textLayer.backgroundColor = UIColor.clear.cgColor
            var rect = tuple.rect            rect.origin.x *= viewWidth
            rect.size.width *= viewWidth
            rect.origin.y *= viewHeight
            rect.size.height *= viewHeight            textLayer.frame = rect
            textLayer.string = tuple.text
            textLayer.foregroundColor = UIColor.green.cgColor
            self.view.layer.addSublayer(textLayer)
        }
    }
}

At line 11, try imageRequestHandler.perform([textDetectionRequest!]). This is a synchronous call, it performs the text detection and calls our detection handler method handleDetection. In this method, we store the detection observations in textObservations instance variable. At line 16, we create a CIImage from the current pixel buffer, calculate the appropriate transform for image at line 17 and then perform the transform at line 18. Next, we iterate through all the elements of textObservations. From the start of loop till line 41, we crop from original image and only get the cropped image where we have text detection available. At 42, we give that cropped image to tesseract to call its recognize method. Tesseract sometime appends new line character at the end of detected text, so we are removing new line characters at line 47. Next we adjust the position of detected text and add it to a rect/text tuple array. At line 56 we make sure to empty textObservations as it might end up being used if no text is detected in next frame. We dispatch in main queue, first remove all the existing text layers and then add new text layers for new recognized text.

The whole running project can be downloaded from here. Enjoy text recognition in iOS 11 and let me know about if you have any questions.

Lets connect on linkedin and twitter

Scene Text Recognition in iOS 11

Project Setup

Text Recognition

Written by Khurram Shehzad