Vision in iOS: Text detection and Tesseract recognition

Let me tell you a story.
Two weeks ago I joined the Stupid Hackathon in Oslo, where people came up with some stupid ideas and hacked together. As I just watched the Big big numbers counting clip by Donald Trump, I thought it might be a good/stupid idea to make a fun iOS app that can recognise a number and tell if it is big enough, all through Trump’s voice.

Before I probably needed to use some libraries like OpenCV to solve this text tracking challenge. Now with the introduction of Vision in iOS 11, I have all the thing I need. So the implementation doesn’t take a long time, it is like playing Lego.

In this guide, I will show you the technical details on working with Vision in iOS, as well as the experience I learned.

Here is the final project on GitHub — BigBigNumbers. You can use it for reference when reading this guide. The project uses Swift 4.1 with iOS 11. There are ViewController containment and multiple service classes to break down responsibilities, so we can easily follow along.

Ah, and OCR stands for Optical Character Recognition which is the process of converting images to readable texts. We will use this abbreviation on the way. Now let’s go to code!

Camera session

Firstly we need to setup camera session as we need to capture the picture for text recognition. The camera logic and its preview layer are encapsulated in a custom view controllerCameraController.

Here we setup a default capture device for back camera. Remember to set videoGravity to resizeAspectFill to get full screen preview layer. To get captured buffer from camera, our view controller needs to conform to AVCaptureVideoDataOutputSampleBufferDelegate:

Every captured frame reports a buffer information through the delegate function func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection).

Now that we have CMSampleBuffer, let ‘s work with Vision in our VisionService class.

Vision

Vision was introduced at WWDC 2017 together with Core ML. It provides easy to use computer vision APIs with many interesting features like face detection, facial landmarks, object tracking, text tracking. Looking at the documentation there is VNDetectTextRectanglesRequest which is perfectly fine for our task in this guide.

An image analysis request that finds regions of visible text in an image

We need to make a request to Vision to get detected rectangle within the captured frame. The VNImageRequestHandler accepts CVPixelBuffer, CGImage and image Data. Here we convert from CMSampleBuffer to CGImage via CVImageBuffer.

Vision exposes a very high level APIs, so working with it is as easy as passing the request to VNDetectTextRectanglesRequest.

Here the orientation parameter to VNImageRequestHandler is important, as you read in Prepare an Input Image for Vision:

Vision handles still image-based requests using a VNImageRequestHandler and assumes that images are oriented upright, so pass your image with orientation in mind. CGImage, CIImage, and CVPixelBuffer objects don’t carry orientation, so provide it as part of the initializer.

We need to convert from UIImageOrientation to CGImageOrientation for Vision to properly work. Here is the code from Apple sample:

The result should be an array of VNTextObservation, which contains region information for where the text is located within the image. For this demo, I only select results with big enough confidence.

What you get is what you should see. Let’s draw the region in BoxService in the main queue.

Even when Vision calls its completion handlers on a background thread, always dispatch UI calls like the path-drawing code to the main thread. Access to UIKit, AppKit & resources must be serialized, so changes that affect the app’s immediate appearance belong on the main thread.

Drawing the detected text region

We can draw using drawRect, but a CALayer with custom border should be easier.

Keep in mind thatVNTextObservation has an array of characterBoxes of type VNRectangleObservation. Those contains information individual character bounding boxes found within the observation’s boundingBox. This is for fine grain control, however in our app we just need the whole bounding box. As VNTextObservation subclasses from VNDetectedObjectObservation, we have access to the whole boundingBox.

The coordinates are normalized to the dimensions of the processed image, with the origin at the image’s lower-left corner.

Now we can use layerRectConverted from AVCaptureVideoPreviewLayer to convert from boundBox to view rect. There may be some advanced calculations that make the rectangle show up in place, but for now this simple function works.

Converts a rectangle in the coordinate system used for metadata outputs to one in the preview layer’s coordinate system.

If you simply want to draw the rectangle onto the captured image, then you can follow Apple ‘s sample using the helper boundingBox function:

Cropping the detected text region

Still within the BoxService, we should crop the image in the detected rectangle for OCR (Optical Character Recognition). We compute in the coordinate of the captured image and insert a big to take a slightly bigger image to accommodate for top and bottom edges. The code is tweaked from Convert Vision boundingBox from VNFaceObservation to rect to draw on image:

The croppedImage should contain text, you can use Quick Look in Xcode to check.

Now that we have an image that ‘s ready for text recognition. Let’s work with OCRService.

Text recognition

I personally like pure Swift solution, so SwiftOCR is a perfect choice, it is said to perform better than Tesseract. So I gave it a try. The API can’t be simpler.

For some reasons, I don’t know, but this does not work well. It might because of the font Lato I use in Sketch (this is how I quickly test the text detection). I read that SwiftOCR allows custom training for new font, but because I was lazy, I tried Tesseract.

Image from https://cybermirror.deviantart.com/art/Loki-and-the-Tesseract-718096717

Tesseract is a “is an optical character recognition engine for various operating systems. It is free software, released under the Apache License, Version 2.0, and development has been sponsored by Google since 2006”. The iOS port is open source on GitHub and has CocoaPods support. So simply put pod ‘TesseractOCRiOS’ in your Podfile and you’re good to go.

As in README and in the TestsProject, tessdata is needed, it contains language information for Tesseract to work. Without this tessdata then the framework TesseractOCR will yell with some warnings about missing TESSDATA_PREFIX.

Strict requirement on language files existing in a referenced “tessdata” folder.

Download the tessdata from here, add add it as a reference to your Xcode project. The color blue indicates that this folder is added as reference.

You may also need to add libstdc++.dylib and CoreImage.framework to your target:

Tesseract

Using Tesseract is easy. Remember to import TesseractOCR, not TesseractOCRiOS:

g8_blackAndWhite is a convenient filter to increase the contrast in the image for easy detection. For pageSegmentationMode I use singleBlock as our number should be in a uniformed block of text, you can also try singleLine mode. Lastly, we set engineMode to tesseractCubeCombined, which is the most accurate, but it could take some time. You can set to tesseractOnly or cubeOnly to compromise for speed.

It does not detect correctly all the time, but it’s good enough. I hope there is a proper OCR model to use with Core ML.

Now we get the recognised text, convert them to number is like a tweaking game depending on your need, here is my simple function:

Let’s make some noise

With the number detected, let’s play some sound depending on how big the number is.

From the video, we can download and extract audio file from it using youtube-dl, here is the command I use.

youtube-dl — extract-audio — audio-format mp3 https://www.youtube.com/watch\?v\=a9jWco4xw-U

Now we need to trim this audio to multiple part, each for a range of number. Initially I would like to use ffmpeg to automate the trimming, but it is not very millisecond precise.

ffmpeg -i trump.mp3 -ss 00:00:03.700 -t 00:00:04.250 -c copy 0.mp3

So I use the Quick Time app on macOS, which has a trimming feature. It is labor work, but I managed to do that 😅

Now we can use AVPlayer to play sound in our MusicService class. Build and run the app, point your camera onto some numbers, tap on the screen and app should detect the text, recognise the number and play some sound speaking of Trump.

This app may not be very useful, but it can be tweaked for more practical usage, like tracking room booking, phone number analysis or simply text scanning.

Where to go from here

I hope you learn something. Here are some more links to help you get started with your text detection journey on iOS: