How to create text recognition with Vision Framework in Swift

6 min readDec 21, 2022

For our patient management app at Exxeta, we did some brainstorming to create cool new feature ideas. So, one of them was to add an ID card scanner for the patients in the creation process. In the first step, we focused on the German ID card, but it should be easy to extend. Recently, Apple published some cool features, like text recognition in the photo gallery. After a little research, we figured out the Vision framework had all we needed.

Apple Documentation about the Vision framework:

The Vision framework performs face and face landmark detection, text detection, barcode recognition, image registration, and general feature tracking. Vision also allows the use of custom Core ML models for tasks like classification or object detection.

We realized our text detection with the Vision framework. But this is not the only framework we need. We also used the AVFoundation because we wanted to visualize the text detection for our user via the camera stream, and we needed an output to recognize the text from. Let’s get started with the implementation.

Set up the capture session

import UIKit
import AVFoundation
import Vision

/// Camera view displays the ui for taking front and back photos of the id scan
final class IDScanCameraView: UIView {

    private var captureSession: AVCaptureSession?
    private var previewLayer: AVCaptureVideoPreviewLayer?
    private var photoOutput = AVCapturePhotoOutput()
    private var photoSettings: AVCapturePhotoSettings?

}

In the first step, we are declaring our IDScanCameraView and import AVFoundation. We are also defining the properties that are needed for a capture session.

/// Setup the captureSession input which is the camera feed
private func setupInput() {
    var backCamera: AVCaptureDevice?

    /// Get back camera
    if let device = AVCaptureDevice.default(.builtInWideAngleCamera, for: .video, position: .back) {
        backCamera = device
    } else {
        fatalError("Back camera could not be found")
    }

    /// Enable continuous auto focus
    do {
        try backCamera?.lockForConfiguration()
        backCamera?.focusMode = .continuousAutoFocus
        backCamera?.unlockForConfiguration()
    } catch {
        fatalError("Camera lockConfiguration failed")
    }

    /// Create input from our device
    guard let backCamera = backCamera, let backCameraInput = try? AVCaptureDeviceInput(device: backCamera) else {
        fatalError("Could not create device input from back camera")
    }

    if let captureSession = captureSession, !captureSession.canAddInput(backCameraInput) {
        fatalError("could not add back camera input to capture session")
    }

    captureSession?.addInput(backCameraInput)
}

In this function, we are accessing the camera on the back of the phone. Here, we decided that we did not want to focus via tapping on an area in the input stream, so we enabled continuous autofocus. Finally, we are setting our source as the capture session input.

/// Setup the captureSession output which is responsible for the generated pictures
private func setupOutput() {
  /// Use HEVC as codec if available to save file space and maintain quality
  if self.photoOutput.availablePhotoCodecTypes.contains(.hevc) {
    photoSettings = AVCapturePhotoSettings(format:[AVVideoCodecKey: AVVideoCodecType.hevc])
  } else {
    photoSettings = AVCapturePhotoSettings()
  }

  if let captureSession = captureSession, captureSession.canAddOutput(photoOutput) {
    captureSession.addOutput(photoOutput)
  }
}

Here we are configuring our photo settings and setting the AVCapturePhotoOutput as the capture session output.

/// Setup and start the captureSession
func setupCaptureSession() {
    captureSession = AVCaptureSession()
    captureSession?.beginConfiguration()

    if let captureSession = captureSession, captureSession.canSetSessionPreset(.photo) {
      captureSession.sessionPreset = .photo
    }

    setupInputs()
    setupOutput()
    setupPreviewLayer()

    captureSession?.commitConfiguration()

    /// Start of the capture session must be executed in the background thread 
    /// by our extension function so the UI is not blocked in the main thread   
    DispatchQueue.background(background: { [weak self] in
      self?.captureSession?.startRunning()
    })
}

So, here we have our main function to set up our capture session. As you can see, we are doing our configuration here. The previous functions we created get executed here, as well as other configurations, before we are starting the flow of data from the capture session’s inputs to its outputs.

Vision Framework Setup

Before we head into the text detection of our ID card, we had to think of the most effective strategy to do this. So, we decided to detect only the text on the back of the ID card. Every piece of information we need is given here, including the first name, last name, birthday, address and country.

In the next steps, we want to use the Vision framework for detecting the document (our ID card) before we start the text detection. We also want to correct the perspective so we always have the perfect angle to detect the text from the ID card.

func perspectiveCorrectedImage(from inputImage: CIImage, rectangleObservation: VNRectangleObservation ) -> CIImage? {
    let imageSize = inputImage.extent.size

    /// Verify detected rectangle is valid
    let boundingBox = rectangleObservation.boundingBox.scaled(to: imageSize)
    guard inputImage.extent.contains(boundingBox)
    else { print("invalid detected rectangle"); return nil}

    /// Rectify the detected image and reduce it to inverted grayscale for applying model
    let topLeft = rectangleObservation.topLeft.scaled(to: imageSize)
    let topRight = rectangleObservation.topRight.scaled(to: imageSize)
    let bottomLeft = rectangleObservation.bottomLeft.scaled(to: imageSize)
    let bottomRight = rectangleObservation.bottomRight.scaled(to: imageSize)
    let correctedImage = inputImage
        .cropped(to: boundingBox)
        .applyingFilter("CIPerspectiveCorrection", parameters: [
            "inputTopLeft": CIVector(cgPoint: topLeft),
            "inputTopRight": CIVector(cgPoint: topRight),
            "inputBottomLeft": CIVector(cgPoint: bottomLeft),
            "inputBottomRight": CIVector(cgPoint: bottomRight)
        ])
    return correctedImage
}

Instead of pressuring the user to take ideal images, we are assisting them to achieve decent results because it is probable that the angle is not always ideal or that it can be difficult for the Vision framework to recognize the text at times.

/// Detects the document in the image and cuts out the background
func cropDocumentOut(from image: CIImage) {
    let requestHandler = VNImageRequestHandler(ciImage: image)
    let documentDetectionRequest = VNDetectDocumentSegmentationRequest()

    do {
        try requestHandler.perform([documentDetectionRequest])
    } catch {
        fatalError("Error while performing documentDetectionRequest")
    }

    guard let document = documentDetectionRequest.results?.first,
          let documentImage = perspectiveCorrectedImage(from: image, rectangleObservation: document)?.convertToCGImage() else {
        fatalError("Unable to get document image")
    }

    /// Save our captured photo of the id
    idBackImage = UIImage(cgImage: documentImage)
}

Here we are getting our final image after the vision framework does its job and getting the document out of the image and our perspectiveCorrectedImage() function corrects its angle.

extension UIImage {

    /// Returns recognized text from image in the region of interest
    /// For the machineReadableZone it is important to set the VNRequestTextRecognitionLevel
    /// to .fast because otherwise it will try to correct the found string and this can lead to wrong results
    func getRecognizedText(for scanItem: ScanItem,
                           with imageSize: CGSize,
                           recognitionLevel: VNRequestTextRecognitionLevel,
                           minimumTextHeight: Float = 0.03125) -> [String] {
        var recognizedTexts = [String]()

        guard let imageCGImage = self.cgImage else { return recognizedTexts }
        let requestHandler = VNImageRequestHandler(cgImage: imageCGImage, options: [:])

        let request = VNRecognizeTextRequest { (request, error) in
            guard let observations = request.results as? [VNRecognizedTextObservation] else { return }

            for currentObservation in observations {
                /// The 1 in topCandidates(1) indicates that we only want one candidate.
                /// After that we take our one and only candidate with the most confidence out of the array.
                let topCandidate = currentObservation.topCandidates(1).first

                if let scannedText = topCandidate {
                    let convertedRegionOfInterest = scanItem.boundingBox.getFrame(by: imageSize, subtractY: false)
                    let convertedObservationBoundingBox = currentObservation.boundingBox.getFrame(by: imageSize, subtractY: false)

                    if convertedRegionOfInterest.intersects(convertedObservationBoundingBox) {
                        recognizedTexts.append(scannedText.string)
                    }
                }
            }
        }

        request.recognitionLevel = recognitionLevel
        request.minimumTextHeight = minimumTextHeight

        /// Turn off language correction because otherwise this could lead to wrong results in the machineReadaleZone
        request.usesLanguageCorrection = false

        try? requestHandler.perform([request])

        return recognizedTexts
    }

}

This is our final function for realizing our text detection via the Vision framework. We decided to declare it in the UIImage extension, so we are capable of performing an easy-to-access text recognition on images. As you can see, we are starting a VNRecognizeTextRequest and going through the best results that Vision delivers to us. The results are added to our recognizedTexts array, which gets processed and assigned to its type of value.

Conclusion

After the first time using Vision in a productive way, we can say it is capable of precise and reliable text detection and almost never fails in our app. In situations where you are taking photos from an impossible angle or other bad environmental influences occur, it can happen that the text recognition fails. We think it is a powerful and pretty cool framework to do cool stuff in document detection, text detection and much more.

We are looking forward to watching out for other cool use cases to use it. (by Alper Kocaatli)

How to create text recognition with Vision Framework in Swift

Set up the capture session

Vision Framework Setup

Conclusion

Written by Mobile@Exxeta