Using cloud OCR to parse links and emails from photos

Photo of a Shortlist magazine ad, taken by the author.

TL;DR I created an app which allows you to take a photo of a link or an email and have it available to use on your iPhone. To get it, scroll to the bottom of the article.

EDIT: You can download the app here: Link As You Go

I went to this year’s iOSDevUK conference and it was as inspiring as always (@iosdevuk, Thanks guys!). The new CoreML framework Apple introduced earlier this year was one of the topics. That reminded me of an idea I had a while back — to make an app which can “read” links from the photos you take at conferences. You all know these kinds of photos — full of links to useful resources, samples and further reading. You usually see them for a few seconds, not nearly enough time to write them down. The best you can do is to take a photo and do the whole dance of transcribing it later — hard job if you only have your phone with you.

Case in point — taken at iOSDevUK’17 by the author.

Still at the conference, I started looking into using CoreML to analyse the photos on the device. Turned out it was not that easy. There wasn’t any out-of-the-box solutions for Swift and my knowledge of CoreML and ML in general was non-existent. I tried searching for a trained model and some easy way of integrating something someone else had done but couldn’t find any. If you happen to know how to do that, I’d welcome the help!

Abandoning CoreML, there were other options to do what I wanted on device. Searching for a solution lead me to two possible libraries I could use — Tesseract-OCR-iOS and SwiftOCR. Tesseract-OCR had been helping developers to do OCR for years and having Google as a supporter does give you credibility. Unfortunately the iOS implementation is only in Objective-C, and although it’s a framework, which means you don’t have to deal with the Objective-C code yourself, you still have to add a bridging header to your project and link against some C libraries. Moreover, the last commit to the Tesseract-OCR-iOS project was 10 months ago and named Revert “Migrated to Xcode8.1 / swift3” which doesn’t give you much confidence. The other candidate, SwiftOCR, turned out to be a niche project — targeting recognition of short, one line long alphanumeric codes. I gave it a try but it seemed to recognise only text in ALL CAPS, no longer than a dozen characters.

After that initial research I decided I needed to change my approach and search elsewhere. I focused my investigation on finding a solution in the cloud. Two companies seemed to offer good APIs for working with OCR — Google and Microsoft. I did a few test runs on both Google Cloud Vision API and Microsoft Computer Vision API and for my needs Microsoft’s was performing better. Here’s a direct comparison using a photo I took at Google Campus London last week:

Google Vision API
Microsoft Computer Vision API

As you can see, Microsoft’s engine was able to parse the link correctly while Google got one character wrong — it recognised l instead of /. This is a pretty common mistake with the OCR engines — the problem is they expect (or, at least, prefer) the text in the photo to be level and shot at the right angle and the one in the example above is not. Interestingly enough, it turns out the engines prefer images rotated by 90° or even upside-down than ones rotated by, say, 15°.

Before deciding on a solution I had one more important criteria — price. Both platforms offer some kind of a trial period and some “free” credits to use during that period. Google is more generous with $300 which you can spend over the next 12 month. Microsoft offers £120 for 30 days. After you spend all this, you pay as much as you use, which sounds reasonable. Both platforms give you a certain amount of free requests each month and you have to pay for the rest of the requests your app makes above that limit. For Google that amount is 1000 requests a month and the price afterwards (for OCR) is $3.50 per 1000 requests. For Microsoft that amount is 5000 requests per month and afterwards you pay £1.118 per 1000 requests (for OCR, tier S2). I figured 5000 requests a month would be enough for the initial proof-of-concept and maybe for a bit longer, until I find out a way to make money. Or if no one likes the app and I end up being the only user.

I ignored all the advice I had received in my career so far and skipped prototyping and initial user testing. I do believe these are important, so this is not a post convincing you they’re not. I had a basic idea how the app would look like and behave and I started coding. The first issue I had was the request size limit Microsoft had put — the API would reject any photo over 4MB. With my iPhone 7 (now replaced with an even worse offender — iPhone 8+) my photos were quite big. I was worried shrinking them would lower the recognition rate but I had no choice. I set the JPEG image quality to 70% which gave me some breathing space and the recognition rate seemed to stay the same. Hopefully, Microsoft would start supporting HEIF soon.

The second hurdle was sending the same photo over and over — I needed some way of recognising the user was sending the same photo and skipping the sending part altogether. I opted out to saving a md5 of the image alongside the previous response from the API — that way I was able to just parse the response again (in case my parsing algorithm had improved since the last time) and present the result to the user.

The third problem came after I attended a meetup in a hall with a smaller screen. The presentations looked the same but there was an important difference — most of the links were spanning on two lines. That was something my app couldn’t handle — the way the Computer Vision API works is it gives you the results line by line. So I had to account for that scenario. And since I was using the built-in link detector in iOS, solving the problem was not that easy. I ended up looking for lines ending in either / or - which seemed to be the majority of cases. Then I was merging the line with the next line and parsing the newly formed line as a whole. That seemed to work, but produced false positives in some cases — i.e. when the link was on one line but ended with a / . Guess I had to choose the lesser evil.

There are other issues which at the moment are unsolvable, like the above mentioned l instead of / case or when it thinks g is actually a q (i.e. qithib.com). In order to solve these, I need either a very sophisticated text-to-link algorithm or a OCR on device which I can train.

After building the initial version, I showed it to a colleague at work. She was impressed and pushed me to show it to the whole company. I wasn’t sure how they would react so I reached out to Emily — the company’s founder — to ask for her opinion/blessing. She liked the idea and helped me present to the rest of the company. I was a bit nervous as I hadn’t presented anything before. The presentation went great, people loved the idea. Some of them offered to help, others were eager to get their hands on the app. It was a great experience. Presenting to a friendly crowd goes a long way in building your confidence. I am thankful to the wonderful people at Seenit!

Having an app that only works at conferences was a risk. People don’t get attached to apps they only need twice a year. And I was definitely not going into the scanning-business-cards business. I was thinking about other potential use cases on my way home one day when it hit me. I was spending my time on the tube reading the free magazines you find at the stations — my favourite was Shortlist, but I enjoyed reading ES as well. They both have one thing in common — trying to make you buy things. They use similar approaches — models showcase clothes and acccessories and you get a small paragraph alongside a link to the website where you can buy the goods. Bingo — links! I tried the app on a page from a recent Shortlist issue and the result was good (as you can see). The app manages to recognise most of the links, especially if you manage to keep the magazine from curving. And because that’s your magazine and you usually have way more time on your hands compared to a busy conference, you can easily retake the photo if the app fails with scanning the links. Brilliant.

Boots, anyone?

To summarise:

  • When you see a problem around you, try and fix it.
  • Do go to conferences, that will get your creative juices flowing.
  • Surround yourself with positive people that support you (like at Seenit).
  • Don’t be shy — talk to people about your ideas!

Wanna try the app?