Fundamentals of optical character recognition (OCR)

12 min readJan 23, 2024

What is Optical Character Recognition (OCR)?

Optical character recognition in short OCR is a technology that helps to recognize text inside images, photos, and scanned documents. This advanced technology is used to convert almost any image having text (printed, typed, or handwritten) into machine-readable text.

OCR technology gained popularity in the early 1990s, particularly when digitizing historical newspapers. Since then, this technology has undergone major improvements. Modern solutions now have the capability to achieve near-perfect OCR accuracy.

SHORT HISTORY OF OCR :

In 1974, Ray Kurzweil founded Kurzweil Computer Products, Inc. Their omni-font optical character recognition product could recognize text printed in almost any font. Kurzweil believed that the most valuable use of this technology would be to help the blind, so he developed a reading machine that could read text aloud in a text-to-speech format.

In 1980, he sold his company to Xerox, who saw the potential to advance paper-to-computer text conversion and wanted to commercialize the technology further.

Before OCR technology was available, manual retyping was the only way to convert documents into digital format. This process was time-consuming and led to inaccuracies and typing errors. Today, Optical Character Recognition technology is used in several industries for various purposes.

A basic theoretical overview of the working of an Optical Character Recognition system.

The most well-known use case for OCR is converting printed paper documents into machine-readable text documents. Once a scanned paper document goes through OCR processing, the text of the document can be edited with word processors like:

Microsoft Word
Google Docs

Before OCR technology was available, the only option to digitize printed paper documents was manually re-typing the text. Not only was this massively time-consuming, but it also came with inaccuracy and typing errors.

OCR is often used as a “hidden” technology, powering many well-known systems and services in our daily life. Less known, but as important, use cases for OCR technology include:

Passport recognition for airports
Traffic sign recognition
Extracting contact information from documents or business cards
Converting handwritten notes to machine-readable text
Defeating CAPTCHA anti-bot systems
Making electronic documents searchable like Google Books or PDFs
Data entry for business documents (bank statements, invoices, receipts)
Aids for the blind

OCR technology has proven immensely useful in digitizing historic newspapers and texts that have now been converted into fully searchable formats and has made accessing those earlier texts easier and faster.

How to Determine If a Collection Needs OCR

How does a collector know if their materials need OCR at all? Most collectors who want their digital collection to include any semblance of full-text search will require OCR at some point in the digitization process. Without OCR, collections are reduced to images without content. This might be all some collections need, but most projects will benefit from the added layer of text.

Full-text search is dependent on the OCR quality and enhances a user’s search accuracy. For instance, someone using the final product might search for “Tolkien” to look for the author in the collection. Without the text data overlay, there would be no information in the images to search. But if the document has been properly OCRed, a user could sort individual documents in the collection by author name (quickly leading them to any works by Tolkien in the collection) or search the entire collection for all mentions of the author within a document

Let’s go…

Below image shows the different phases in the workflow of an OCR system.

Let’s briefly discuss each phase shown in the above image:-

Applications of OCR

Mobile Banking Applications: Applications of mobile banking use OCR to capture and recover data from cheques for deposit.
Healthcare records: OCR is used in the healthcare department to manage the information of the patient so that healthcare facilities can be
Scanners for business cards: Apps with OCR capabilities can scan cards for business and save contacts right into the consumer’s address book.
Recognition of License plate number: OCR technology is used by parking lots and law enforcement to read and recognize license plates for safety and parking management.
Mails: Sorting and processing of mail according to addresses and ZIP codes is done automatically by postal services using OCR.
accessed quickly.

Advantages of OCR

Text digitization: OCR transforms printed or handwritten text into a digital format that is machine readable and can be easily stored, edited and manipulated.
Searchability: It enables users to rapidly find specific information inside big document collections by making scanned documents and images searchable.
Time and Money Savings: By automating OCR, businesses may cut back on the time and money spent on manually entering information and document management.
Archiving and document management: It makes it easier to digitize paper records and archives, which lowers the need for physical storage and enhances document retrieval.
Multilingual Support: Numerous OCR systems include multilingual support, making them useful for multinational corporations and organizations.

Disadvantages of OCR

Proofreading: It is necessary since errors may still be present in the output, even with high-quality OCR. Thus, hand proofreading is frequently required, especially for important texts.
Handwriting Recognition: OCR technology may, in some cases, identify handwritten text, however it often performs less well in this area when compared to printed text.
Security Concerns: OCR technology has the potential to be taken advantage of offensively to extract private information from papers.
Cost: Premium OCR software might be costly, and commercial solutions may require continuous licensing or subscription fees.
Limited Context Understanding: These systems frequently lack context awareness, which can cause them to misread words or phrases whose meanings depend on the context of the surrounding text.

Convert image to text using OCR

Image-to-text conversion is the most common OCR use case. It refers to any web or mobile application transforming a picture containing text into plain machine-readable text. The main goal of those tools is to help the user transcribe all the text captured in a photo in a few seconds, instead of manually typing the whole text. The resulting text can be then copied into the user’s clipboard and used on his device.

Digitize documents with OCR

Also known as dematerialization, digitizing a document consists of creating a machine-readable text copy of a document to store it in document management software. It can either be a plain text format or an editable copy with the same layout. In the second scenario, an OCR capable of detecting the position of each word along with their text content is required to detect the layout of the documents.

Most digitization companies are providing their clients with the appropriate hardware (scanners) to handle the conversion from paper documents to digital data.

Document digitization process using OCR, paper, scan, OCR, and document management software

Make PDFs searchable and indexable with OCR

Any archive of unstructured documents can be transformed into machine-readable text in order to make each document searchable using a natural language query. Using only an OCR, you can simulate a CTRL/CMD+F search within a scanned document on the text it contains. For more advanced OCR use cases, it’s likely that you need to build a search engine to look up different semantic information written in your document. Adding key information extraction features on top of your OCR might be required before indexing the extracted data in a search engine.

Optical character recognition for document recognition

Also known as document classification, this task is about automatically classifying a new document by assigning it a type among a predefined set of document classes. The role of the OCR, in this case, is to extract all the words and characters from a document and use them as features for a classifier down the road. This classifier can be based on simple keyword detection rules (i.e. taxes for invoices or receipts, passport numbers for passport documents…) as well as machine learning (ML) algorithms for more complex classification. Using an ML model is a real advantage for this task, as it doesn’t require an extremely robust OCR to get very high performances.

Image feature extraction

Most deep learning models process visual information using several layers of convolutions. Convolutions are a type of mathematical function used in a neural network typically used to recognize patterns present in images. Each layer extracts information from its previous neighbor by identifying local spatial patterns. Such layers in stacks thus increasingly extract wider and more complex spatial patterns.

Feature pooling

Each object is represented by a set of coordinates which is the bounding box that includes the object we are looking for (words in that case). For each object’s candidate, let’s imagine that the extracted features are a set of (N, N) matrices, (N being generally much smaller than the input image’s size), where the top left corner of those corresponds to the spatial patterns that were extracted from the top left of your input image. We need to focus now on a wide range of object sizes in the image. To do this, we will set some priors:

a set of expected aspect ratio
a set of expected scales
a set of expected object centers

More specifically, at a given location/object center, we will only consider objects that are close to any combination of our predefined aspect ratios and scales.

Now that we have our extracted features and the approximate estimated localization (bounding box) of an object, we will extract the features from each location and resize them using an operation called region pooling (RoI pooling in Faster-RCNN, RoI Align in Mask-RCNN ) illustrated below.

Feature pooling computer vision

Using this, if we had M expected object centers, 3 possible aspect ratios, and 3 possible scales, we would get a set of 9 * M pooled features (each corresponding to one combination of center, aspect ratio, and scale). Please note that this is true irrespective of the size of our extracted features.

Box classification and regression

For each of our object candidates, we can now perform different tasks. In object detection, we will refine the bounding box (starting from our corresponding prior of center + scale + aspect ratio), and classify the object (e.g. is it a dog or a cat?). In our context, the classification will simply be binary (is this a word or not?).

A second approach to text detection: image segmentation

Popular computer vision models architectures: U-Net, DeepLab

Image segmentation for text detection

Using the same extracted features as in object detection, the set of (N, N) spatial features, we now need to upscale those back to the image dimensions (H, W). Remember that N is smaller than H or W.

Feature upscaling

Especially if our (N, N) matrices are much smaller than the image, basic upsampling would bring no value. Rather than just interpolating our matrices to (H, W), architectures have different tricks to learn those upscaling operations such as transposed convolutions. We then obtain fine-grained features for each pixel of the image.

Binary classification

Using the many features of these pixels, a few operations are performed to determine their category. In our case, we will only determine whether the pixel belongs to a word, which produces a result similar to the “segmentation” illustration from the previous figure.

Bounding box conversion

Finally, the binary segmentation map needs to be converted into a set of bounding boxes. It sounds like a much easier task than producing the segmentation map, but two words close to each other might look like they are the same according to your segmentation map. Some post-processing is required to produce relevant object localization.

OCR Text Recognition

The text recognition block in an OCR pipeline is responsible for transcribing the character sequence of an image:

Input: image with a single character sequence
Output: the value of the word

The assumption that there is a single character sequence allows us to consider two approaches.

A first approach to text recognition: rolling character classification

Rolling Character Text Recognition

Split words into character crops

You can find numerous computer vision techniques to horizontally localize each character, such as horizontal color histograms. Generally, this operation does not require extensive computation power thanks to our assumption of a single-character sequence.

Character classification

Now instead of having a character sequence to identify, this becomes a simple image classification problem. Instead of predicting whether this is a dog or a cat (is it a word or not), we will have to identify directly the character in our image.

Fortunately for us, this does not require large input images or very deep architectures as Yann Lecun proved with his LeNet5 model trained to classify handwritten digits back in 1998. Concisely, the model will extract spatial features, aggregate them, and will classify the whole set of features to predict the result. Then we only need to assemble our character predictions into word predictions.

A second approach for text recognition: sequence modeling

OCR Sequence modeling character recognition

Image feature extraction

Our word image crops can now be fed to a model to extract spatial features similar to what we have done in object detection (at a much much lower resolution this time). For simplification purposes, let’s imagine this yields a set of (1, N spatial features.

Aligning and classifying

Now it is very important to note that the extracted features will always have the same output size irrespective of the length of the word. Meaning that in our (1, N spatial features, there are rarely N characters. You might have understood it: we have an alignment problem.

This alignment problem is common to all sequence modeling problems such as transcribing audio into text, image into a character sequence, etc.

One of the modern approaches to this involves using the connectionist Temporal Classification (brilliantly explained in this article ) which proceeds in two steps:

alignment: going from our N-6 features (x1, …, x6) to N aligned characters (“c”, “c”, “a”, “a”, “a”, “t”)
collapsing: merging the N-aligned characters into a clean output (“c”, “a”, “t”) OCR obstacles real-world obstacles

OCR resources

Depending on your constraints and internal resources, you can leverage OCR in multiple ways: from implementing your own OCR and handling its production-ready deployment to delegating the whole process to a third-party commercial service.

Open-source OCR libraries

While open-source resources usually don’t come without any direct financial expenses, you need to have internal resources to orchestrate your service’s deployment. However, this usually makes room to customize your prediction block (re-training, optimization, interfacing) since you have access to the entire codebase.

Some of the Open Source OCR tools are Tesseract, OCRopus.

In this article, we will focus on Tesseract OCR. And to read the images we need OpenCV.

Installation of Tesseract OCR:

Download the latest installer for windows 10 from “https://github.com/UB-Mannheim/tesseract/wiki“. Execute the .exe file once it is downloaded.

The typical installation path in Windows systems is C:Program Files.

So, in my case, it is “C: Program FilesTesseract-OCRtesseract.exe“.

Next, to install the Python wrapper for Tesseract, open the command prompt and execute the command “pip install pytesseract“.

OpenCV

OpenCV(Open Source Computer Vision) is an open-source library for computer vision, machine learning, and image processing applications.

OpenCV-Python is the Python API for OpenCV.

To install it, open the command prompt and execute the command “pip install opencv-python“.

Build sample OCR Script

1. Reading a sample Image

import cv2

Read the image using cv2.imread() method and store it in a variable “img”.

img = cv2.imread("image.jpg")

If needed, resize the image using cv2.resize() method

img = cv2.resize(img, (400, 400))

Display the image using cv2.imshow() method

cv2.imshow("Image", img)

Display the window infinitely (to prevent the kernel from crashing)

cv2.waitKey(0)

Close all open windows

cv2.destroyAllWindows()

2. Converting Image to String

import pytesseract

Set the tesseract path in the code

pytesseract.pytesseract.tesseract_cmd=r'C:Program FilesTesseract-OCRtesseract.exe'

To convert an image to string use pytesseract.image_to_string(img) and store it in a variable “text”

text = pytesseract.image_to_string(img)

print the result

print(text)

Complete code:

import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd=r'C:Program FilesTesseract-OCRtesseract.exe'
img = cv2.imread("image.jpg")
img = cv2.resize(img, (400, 450))
cv2.imshow("Image", img)
text = pytesseract.image_to_string(img)
print(text)
cv2.waitKey(0)
cv2.destroyAllWindows()

The output for the above code:

End Notes

By the end of this article, we have understood the concept of Optical Character Recognition (OCR) and are familiar with reading images using OpenCV and grabbing the text from images using pytesseract. We have seen two basic applications of OCR — Building word clouds, creating audible files by converting text to speech using gTTS.

References:

gTTS documentation
OpenCV documentation
pytesseract documentation
Check out the complete Jupyter Notebook from my GitHub repo