Let me tell you a story.
Two weeks ago I joined the Stupid Hackathon in Oslo, where people came up with some stupid ideas and hacked together. As I just watched the Big big numbers counting clip by Donald Trump, I thought it might be a good/stupid idea to make a fun iOS app that can recognise a number and tell if it is big enough, all through Trump’s voice.
Before I probably needed to use some libraries like OpenCV to solve this text tracking challenge. Now with the introduction of Vision in iOS 11, I have all the thing I need. So the implementation doesn’t take a long time, it is like playing Lego.
In this guide, I will show you the technical details on working with Vision in iOS, as well as the experience I learned.
Here is the final project on GitHub — BigBigNumbers. You can use it for reference when reading this guide. The project uses Swift 4.1 with iOS 11. There are ViewController
containment and multiple service classes to break down responsibilities, so we can easily follow along.
Ah, and OCR stands for Optical Character Recognition which is the process of converting images to readable texts. We will use this abbreviation on the way. Now let’s go to code!
Camera session
Firstly we need to setup camera session as we need to capture the picture for text recognition. The camera logic and its preview layer are encapsulated in a custom view controllerCameraController
.
Here we setup a default capture device for back camera. Remember to set videoGravity
to resizeAspectFill
to get full screen preview layer. To get captured buffer from camera, our view controller needs to conform to AVCaptureVideoDataOutputSampleBufferDelegate:
Every captured frame reports a buffer information through the delegate function func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection).
Now that we have CMSampleBuffer
, let ‘s work with Vision
in our VisionService
class.
Vision
Vision was introduced at WWDC 2017 together with Core ML. It provides easy to use computer vision APIs with many interesting features like face detection, facial landmarks, object tracking, text tracking. Looking at the documentation there is VNDetectTextRectanglesRequest which is perfectly fine for our task in this guide.
An image analysis request that finds regions of visible text in an image
We need to make a request to Vision
to get detected rectangle within the captured frame. The VNImageRequestHandler
accepts CVPixelBuffer
, CGImage
and image Data
. Here we convert from CMSampleBuffer
to CGImage
via CVImageBuffer
.
Vision
exposes a very high level APIs, so working with it is as easy as passing the request to VNDetectTextRectanglesRequest
.
Here the orientation
parameter to VNImageRequestHandler
is important, as you read in Prepare an Input Image for Vision:
Vision handles still image-based requests using a
VNImageRequestHandler
and assumes that images are oriented upright, so pass your image with orientation in mind.CGImage
,CIImage
, andCVPixelBuffer
objects don’t carry orientation, so provide it as part of the initializer.
We need to convert from UIImageOrientation
to CGImageOrientation
for Vision to properly work. Here is the code from Apple sample:
The result should be an array of VNTextObservation
, which contains region information for where the text is located within the image. For this demo, I only select results with big enough confidence
.
What you get is what you should see. Let’s draw the region in BoxService
in the main queue.
Even when Vision calls its completion handlers on a background thread, always dispatch UI calls like the path-drawing code to the main thread. Access to UIKit, AppKit & resources must be serialized, so changes that affect the app’s immediate appearance belong on the main thread.
Drawing the detected text region
We can draw using drawRect
, but a CALayer
with custom border should be easier.
Keep in mind thatVNTextObservation
has an array of characterBoxes
of type VNRectangleObservation
. Those contains information individual character bounding boxes found within the observation’s boundingBox
. This is for fine grain control, however in our app we just need the whole bounding box. As VNTextObservation
subclasses from VNDetectedObjectObservation
, we have access to the whole boundingBox
.
The coordinates are normalized to the dimensions of the processed image, with the origin at the image’s lower-left corner.
Now we can use layerRectConverted
from AVCaptureVideoPreviewLayer
to convert from boundBox
to view rect. There may be some advanced calculations that make the rectangle show up in place, but for now this simple function works.
Converts a rectangle in the coordinate system used for metadata outputs to one in the preview layer’s coordinate system.
If you simply want to draw the rectangle onto the captured image, then you can follow Apple ‘s sample using the helper boundingBox
function:
Cropping the detected text region
Still within the BoxService
, we should crop the image in the detected rectangle for OCR (Optical Character Recognition). We compute in the coordinate of the captured image and insert a big to take a slightly bigger image to accommodate for top and bottom edges. The code is tweaked from Convert Vision boundingBox from VNFaceObservation to rect to draw on image:
The croppedImage
should contain text, you can use Quick Look
in Xcode to check.
Now that we have an image that ‘s ready for text recognition. Let’s work with OCRService
.
Text recognition
I personally like pure Swift solution, so SwiftOCR is a perfect choice, it is said to perform better than Tesseract. So I gave it a try. The API can’t be simpler.
For some reasons, I don’t know, but this does not work well. It might because of the font Lato
I use in Sketch (this is how I quickly test the text detection). I read that SwiftOCR
allows custom training for new font, but because I was lazy, I tried Tesseract.
Tesseract is a “is an optical character recognition engine for various operating systems. It is free software, released under the Apache License, Version 2.0, and development has been sponsored by Google since 2006”. The iOS port is open source on GitHub and has CocoaPods support. So simply put pod ‘TesseractOCRiOS’
in your Podfile
and you’re good to go.
As in README and in the TestsProject, tessdata
is needed, it contains language information for Tesseract to work. Without this tessdata
then the framework TesseractOCR
will yell with some warnings about missing TESSDATA_PREFIX.
Strict requirement on language files existing in a referenced “tessdata” folder.
Download the tessdata
from here, add add it as a reference
to your Xcode project. The color blue indicates that this folder is added as reference.
You may also need to add libstdc++.dylib
and CoreImage.framework
to your target:
Tesseract
Using Tesseract is easy. Remember to import TesseractOCR
, not TesseractOCRiOS:
g8_blackAndWhite
is a convenient filter to increase the contrast in the image for easy detection. For pageSegmentationMode
I use singleBlock
as our number should be in a uniformed block of text, you can also try singleLine
mode. Lastly, we set engineMode
to tesseractCubeCombined
, which is the most accurate, but it could take some time. You can set to tesseractOnly
or cubeOnly
to compromise for speed.
It does not detect correctly all the time, but it’s good enough. I hope there is a proper OCR model to use with Core ML.
Now we get the recognised text, convert them to number is like a tweaking game depending on your need, here is my simple function:
Let’s make some noise
With the number detected, let’s play some sound depending on how big the number is.
From the video, we can download and extract audio file from it using youtube-dl, here is the command I use.
youtube-dl — extract-audio — audio-format mp3 https://www.youtube.com/watch\?v\=a9jWco4xw-U
Now we need to trim this audio to multiple part, each for a range of number. Initially I would like to use ffmpeg
to automate the trimming, but it is not very millisecond precise.
ffmpeg -i trump.mp3 -ss 00:00:03.700 -t 00:00:04.250 -c copy 0.mp3
So I use the Quick Time app on macOS, which has a trimming feature. It is labor work, but I managed to do that 😅
Now we can use AVPlayer
to play sound in our MusicService
class. Build and run the app, point your camera onto some numbers, tap on the screen and app should detect the text, recognise the number and play some sound speaking of Trump.
This app may not be very useful, but it can be tweaked for more practical usage, like tracking room booking, phone number analysis or simply text scanning.
Where to go from here
I hope you learn something. Here are some more links to help you get started with your text detection journey on iOS:
- Tesseract OCR Tutorial for iOS: Learn how to use Tesseract framework in iOS, detailing with some issues that you may encounter when using it.
- Utilizing Machine Learning in the Palm of Your Hand With iOS’s ML Frameworks: How to use Vision with SwiftOCR.
- tesseract.js: Tesseract implemented in Javascript. It is not related to iOS, but it’s good to have the importance of Tesseract in some other platforms.
- Object Tracking in Vision: Interesting changes coming to Vision in iOS at WWDC 2018. There are lot of improvements in object tracking and custom model training.
- Detecting Objects in Still Images: Official Apple sample code to locate and demarcate rectangles, faces, barcodes, and text in images using the Vision framework.
- Integrating Google ML Kit in iOS for Face Detection, Text Recognition and Many More: Google introduced ML Kit at Google IO this year, and it’s also good at text recognition. The framework supports both iOS and Android.