I used this blog by Ben Dodson to get started with iOS 13’s OCR functionality, and you should too, it’s a very good introduction with a good example of how you could use it. Be aware that there are a couple of minor errors in the code, maybe Ben wrote it during the beta of iOS 13 and things have changed since then.
I had problems
I copy-pasted Ben’s code and found rather poor results, but he was getting spot on results for much smaller and less clear text.
There was a difference in how we were both getting our images. I was using a snapshot of my app’s UI, rather than an image from the camera, like Ben. My experience with UIImage and CGImage led me to the root of the issue: the image orientation that was wrong.
You have to tell VNImageRequestHandler what orientation your image is in, because it deals with a CGImage rather than a UIImage.
To start with, I flipped the value that Ben had specified (.right) and my results improved drastically, though I still had some minor imperfections. When I changed the orientation yet again I ended up with some really solid results.
Below is an example of what I was seeing with the different orientations when scanning the following text (I write weird stuff sometimes…)
I can OCR, me
Int that kewl?
With orientation .right
Found: HOO (0.3)
Found: 8 (0.3)
Found: as (0.3)
Found: uno no uR (0.3)
Found: ‘dOO (0.3)
Not very confident (the numbers in brackets are the confidence ratings of each result) and nothing even remotely close to the text on screen!
With orientation .left
Found: g (0.3)
Found: OCR, (0.5)
Found: that kewl? (0.5)
Found: 8 F (0.3)
Found: OCR (0.5)
Found: 8 (0.3)
Ok, this is better, some slightly improved confidence on words that are actually on the screen, but still pretty poor…
With orientation .up
Found: I can OCR, me (1.0)
Found: Int that kewl? (1.0)
Found: Do OCR (0.5)
Pretty much perfect! Exactly the right text and 100% confidence on most of it.
So if you’re not getting good results from VNImageRequestHandler then take a look at the image orientation you’re giving it as it may be the problem!
If you’re using a snapshot of your UI then the orientation of the image is likely to always be ‘.up’. But if you’re loading a photo then it could be almost anything. Apple’s docs provide example code for converting between UIImageOrientation and CGImagePropertyOrientation, though it’s strange they don’t provide that implementation to us through their APIs. Interestingly their example code isn’t even correct for the UIImageOrientation extension, so maybe it’s for the best it’s not part of the APIs!
More things you can do
VNRecognizeTextRequest has some useful properties that you should definitely read up on. Ben went over some of them in his tutorial but there are more.
VNRecognizeTextRequest extends VNImageBasedRequest and VNRequestProgressProviding. The latter should be fairly obvious, it gives you the ability to get progress callbacks as the request is carried out. Considering the OCR requests can take a second or two to process it’s a good idea to give your users some feedback that things are moving forward. Interestingly Apple don’t mention VNRequestProgressProviding in their online docs, I found it by inspecting the docs in the code directly.
Performance of OCR can be crucial. You might want quick results to not hold up your app or user so giving it a helping hand is often a good idea!
First off, VNRecognizeTextRequest has a property called usesLanguageCorrection. If you disable this you get a massive speed boost, but the accuracy of the OCR tanks, so might only be useful in certain situations.
If you know anything about where the text is likely to appear in your image then you can crop it down to the area of interest. Taking Ben’s example of scanning a card you could encourage your user to frame the card in your screen with a guide and then you know which parts of the screen you want to focus in on for certain bits of text (you can use VNImageBasedRequest’s .regionOfInterest for this). You could do multiple OCR requests on individually cropped images rather than one on the image as a whole (and that just means multiple VNRecognizeTextRequests passed to just one VNImageRequestHandler for the same image).
Similarly to cropping you can get a performance boost by scaling down your image. If you know the text is likely to be quite large in your image then scaling it down will likely retain sufficient clarity to still be able to extract the text and Apple’s algorithm should go quicker, though the accuracy does reduce so be careful with this!
There may also be some filtering that you can do to your image to improve the clarity of the text: tweaking brightness, contrast and more may pick out the text better.
So words are good, but what about letters?
I found that detecting individual letters wasn't as good as whole words. I’m sure we’ve all seen those word game apps where you have to form words from a pool of letters. They come in various formats but generally have the letters laid out in a circle near the bottom of the screen:
I figured if I ran OCR on this image I’d be able to do a search on a dictionary of words in the English language and get all the words to solve the puzzle! Haha, tech for the win! Or, equally, how to ruin a game with cheating…
Generally, the OCR picks out most of the letters ok but it would rarely ever get them all. I think a lot of the accuracy with words and sentences comes from the fact Apple doesn’t simply treat each letter in a word as an individual item to be recognised, rather they’re making a list of guesses for each letter and then seeing which of those guesses is the most likely based on what other letters are around it.
Applications in testing
The reason I was looking into OCR in the first place was that I wanted to see if I could create some automated tests that would validate what text was on screen at various points in time.
If you know the view hierarchy of your app you can potentially just find the right view that has your text on and read it from the view’s text property. But just because you can find the view doesn’t necessarily mean it’s visible to the user. Or maybe the text is being rendered in such a way that you can’t pull it out of a property. In these situations, OCR might be a good way of confirming what text is visible to the user in an automated manner.