Vision in iOS: Text detection and Tesseract recognition
Let me tell you a story.
Two weeks ago I joined the Stupid Hackathon in Oslo, where people came up with some stupid ideas and hacked together. As I just watched the Big big numbers counting clip by Donald Trump, I thought it might be a good/stupid idea to make a fun iOS app that can recognise a number and tell if it is big enough, all through Trump’s voice.
Before I probably needed to use some libraries like OpenCV to solve this text tracking challenge. Now with the introduction of Vision in iOS 11, I have all the thing I need. So the implementation doesn’t take a long time, it is like playing Lego.
In this guide, I will show you the technical details on working with Vision in iOS, as well as the experience I learned.
Here is the final project on GitHub — BigBigNumbers. You can use it for reference when reading this guide. The project uses Swift 4.1 with iOS 11. There are
ViewController containment and multiple service classes to break down responsibilities, so we can easily follow along.
Ah, and OCR stands for Optical Character Recognition which is the process of converting images to readable texts. We will use this abbreviation on the way. Now let’s go to code!
Firstly we need to setup camera session as we need to capture the picture for text recognition. The camera logic and its preview layer are encapsulated in a custom view controller
Here we setup a default capture device for back camera. Remember to set
resizeAspectFill to get full screen preview layer. To get captured buffer from camera, our view controller needs to conform to
Every captured frame reports a buffer information through the delegate function
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection).
Now that we have
CMSampleBuffer, let ‘s work with
Vision in our
Vision was introduced at WWDC 2017 together with Core ML. It provides easy to use computer vision APIs with many interesting features like face detection, facial landmarks, object tracking, text tracking. Looking at the documentation there is VNDetectTextRectanglesRequest which is perfectly fine for our task in this guide.
An image analysis request that finds regions of visible text in an image
We need to make a request to
Vision to get detected rectangle within the captured frame. The
CGImage and image
Data. Here we convert from
Vision exposes a very high level APIs, so working with it is as easy as passing the request to
orientation parameter to
VNImageRequestHandler is important, as you read in Prepare an Input Image for Vision:
Vision handles still image-based requests using a
VNImageRequestHandlerand assumes that images are oriented upright, so pass your image with orientation in mind.
CVPixelBufferobjects don’t carry orientation, so provide it as part of the initializer.
We need to convert from
CGImageOrientation for Vision to properly work. Here is the code from Apple sample:
The result should be an array of
VNTextObservation, which contains region information for where the text is located within the image. For this demo, I only select results with big enough
What you get is what you should see. Let’s draw the region in
BoxService in the main queue.
Even when Vision calls its completion handlers on a background thread, always dispatch UI calls like the path-drawing code to the main thread. Access to UIKit, AppKit & resources must be serialized, so changes that affect the app’s immediate appearance belong on the main thread.
Drawing the detected text region
We can draw using
drawRect, but a
CALayer with custom border should be easier.
Keep in mind that
VNTextObservation has an array of
characterBoxes of type
VNRectangleObservation. Those contains information individual character bounding boxes found within the observation’s
boundingBox. This is for fine grain control, however in our app we just need the whole bounding box. As
VNTextObservation subclasses from
VNDetectedObjectObservation, we have access to the whole
The coordinates are normalized to the dimensions of the processed image, with the origin at the image’s lower-left corner.
Now we can use
AVCaptureVideoPreviewLayer to convert from
boundBox to view rect. There may be some advanced calculations that make the rectangle show up in place, but for now this simple function works.
Converts a rectangle in the coordinate system used for metadata outputs to one in the preview layer’s coordinate system.
If you simply want to draw the rectangle onto the captured image, then you can follow Apple ‘s sample using the helper
Cropping the detected text region
Still within the
BoxService, we should crop the image in the detected rectangle for OCR (Optical Character Recognition). We compute in the coordinate of the captured image and insert a big to take a slightly bigger image to accommodate for top and bottom edges. The code is tweaked from Convert Vision boundingBox from VNFaceObservation to rect to draw on image:
croppedImage should contain text, you can use
Quick Look in Xcode to check.
Now that we have an image that ‘s ready for text recognition. Let’s work with
I personally like pure Swift solution, so SwiftOCR is a perfect choice, it is said to perform better than Tesseract. So I gave it a try. The API can’t be simpler.
For some reasons, I don’t know, but this does not work well. It might because of the font
Lato I use in Sketch (this is how I quickly test the text detection). I read that
SwiftOCR allows custom training for new font, but because I was lazy, I tried Tesseract.
Tesseract is a “is an optical character recognition engine for various operating systems. It is free software, released under the Apache License, Version 2.0, and development has been sponsored by Google since 2006”. The iOS port is open source on GitHub and has CocoaPods support. So simply put
pod ‘TesseractOCRiOS’ in your
Podfile and you’re good to go.
As in README and in the TestsProject,
tessdata is needed, it contains language information for Tesseract to work. Without this
tessdata then the framework
TesseractOCR will yell with some warnings about missing
Strict requirement on language files existing in a referenced “tessdata” folder.
tessdata from here, add add it as a
reference to your Xcode project. The color blue indicates that this folder is added as reference.
You may also need to add
CoreImage.framework to your target:
Using Tesseract is easy. Remember to import
g8_blackAndWhite is a convenient filter to increase the contrast in the image for easy detection. For
pageSegmentationMode I use
singleBlock as our number should be in a uniformed block of text, you can also try
singleLine mode. Lastly, we set
tesseractCubeCombined, which is the most accurate, but it could take some time. You can set to
cubeOnly to compromise for speed.
It does not detect correctly all the time, but it’s good enough. I hope there is a proper OCR model to use with Core ML.
Now we get the recognised text, convert them to number is like a tweaking game depending on your need, here is my simple function:
Let’s make some noise
With the number detected, let’s play some sound depending on how big the number is.
youtube-dl — extract-audio — audio-format mp3 https://www.youtube.com/watch\?v\=a9jWco4xw-U
Now we need to trim this audio to multiple part, each for a range of number. Initially I would like to use
ffmpeg to automate the trimming, but it is not very millisecond precise.
ffmpeg -i trump.mp3 -ss 00:00:03.700 -t 00:00:04.250 -c copy 0.mp3
So I use the Quick Time app on macOS, which has a trimming feature. It is labor work, but I managed to do that 😅
Now we can use
AVPlayer to play sound in our
MusicService class. Build and run the app, point your camera onto some numbers, tap on the screen and app should detect the text, recognise the number and play some sound speaking of Trump.
This app may not be very useful, but it can be tweaked for more practical usage, like tracking room booking, phone number analysis or simply text scanning.
Where to go from here
I hope you learn something. Here are some more links to help you get started with your text detection journey on iOS:
- Tesseract OCR Tutorial for iOS: Learn how to use Tesseract framework in iOS, detailing with some issues that you may encounter when using it.
- Utilizing Machine Learning in the Palm of Your Hand With iOS’s ML Frameworks: How to use Vision with SwiftOCR.
- Object Tracking in Vision: Interesting changes coming to Vision in iOS at WWDC 2018. There are lot of improvements in object tracking and custom model training.
- Detecting Objects in Still Images: Official Apple sample code to locate and demarcate rectangles, faces, barcodes, and text in images using the Vision framework.
- Integrating Google ML Kit in iOS for Face Detection, Text Recognition and Many More: Google introduced ML Kit at Google IO this year, and it’s also good at text recognition. The framework supports both iOS and Android.