Optical Character Recognition (OCR) — A branch Of Computer Vision

Henry Dinhofer
4 min readJul 25, 2016

--

Optical Character Recognition (OCR) is the tool that is used when a scanned document or photo is taken and converted into text. It is widely used as a form of data entry from printed paper records whether passport documents, invoices, bank statements, computerized receipts, business cards, mail, or any suitable computerized printouts of data. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

Courtesy of bpolat’s Swfit-OCR-Demo-with-IDOL-OnDemand[1]

To give an overview of the process we manipulate an image, preprocessing it to eliminate as many “confusing” parts as possible.

Possible steps taken are cropping out a multi-colored background, de-skewing a document if it was not aligned properly when scanned to be reoriented a few degrees clockwise/counterclockwise, taking a color/greyscale photo and converting it to plain black and white to reduce “blurred” text and better separate black and white text from its background.

The image is then parsed using a k-nearest neighbors algorithm and outputs the digitized printed text.

History

Early optical character recognition has its roots in technologies involving telegraphy and creating reading devices for the blind. In 1914, Emanuel Goldberg developed a machine that read characters and converted them into standard telegraph code. There was also a device developed during the same time period that was a handheld scanner that when dragged across a printed page emitted tones correlated to specific characters.

In the 1920s and 30s Emanuel Goldberg developed a “Statistical Machine” for searching microfilm archives using an optical code recognition system. In 1931 he was granted a US patent — later that patent was acquired by IBM. Goldberg is credited with many inventions including the microdot.

In 1974 Ray Kurzweil developed and invented the omni-font OCR used by many companies in the 1960s and the 1970s. Kurzweil saw the best use for his technology would be as a reading machine for the blind, which would have a computer read text to a person aloud. He then invented the CCD flatbed scanner and a text-to-speech synthesizer. His company would later be acquired by Xerox.

Contemporary OCR

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google [2]. Tesseract outputs the analyzed text into plain text, PDF and HTML format.

HP’s OCR tool once was offered under HP IDOL OnDemand, but at some point in the past few years is transitioned to management under Hewlett Packard Enterprise (HPE). It’s accessible via HPE’s Haven OnDemand. [3] If you plan on publishing an app, you will need to pay for their service. For developers it is a paid-for tool. For the freemium model you can make up to 50,000 OCR calls. Additionally it comes with 5000 monthly face detection/recognition calls if you want to play around with that. For the highest “Entrepreneur tier” it is a $315 monthly cost that comes with 120k API Units and 35 Resource Units.

OpenCV [4]. is the open source community for Computer Vision. All images that are analyzed via a computer are part of the field of CV. Continuously in development, the OpenCV project is used in many different niche situations that require image recognition. This is an area of Computer Science/Mathematics that PhD candidates are actively exploring.

Results:

I tested out HPE’s OCR Document tool. Its actually super easy to create an account, and before I knew it I had access to an API key. Their website comes equipped with a built in image uploader, so all one needed to do was take a photo and save it to their computer. Then using my laptop’s browser I uploaded and tested the photo.

The photo & digitized text:

Photo was taken perpendicular to surface, with uniform red background and without flash
Result is formatted as a JSON object

The digitized text is from text that originated only on the left side of the receipt. This is likely because I need to change a setting within HPE’s OCR to train their system to recognize multiple columns within the photo. On the upside, there were no “junk” characters within the JSON text. Super cool.

References:

[1] GIF — bpolat’s Swift OCR demo https://github.com/bpolat/Swift-OCR-Demo-with-IDOL-OnDemand

[2] Tesseract framework for iOS 7+ https://github.com/tesseract-ocr/tesseract Another resource: https://github.com/gali8/Tesseract-OCR-iOS

[3] HPE Haven OnDemand— https://www.havenondemand.com/

[4] OpenCV Open Computer Vision — http://opencv.org/

--

--