As organizations everywhere look to digitize their operations, transforming physical documents into digital formats is a common low-hanging fruit to pick. This is usually done with Optical Character Recognition (OCR), where images of text (the scanned physical document) are converted into machine text, via one of several well-developed text-recognition algorithms. Document OCR performs best when working with printed text against a clean background, with consistent paragraphing and font size.
In practice, this scenario is far from the norm. Invoices, forms and even identity documents have information scattered throughout the document space, making the task of digitally extracting relevant data somewhat more complicated.
In this article, we will explore a simple method using Python to define areas in the document image for OCR. We will use an example of a document with information scattered throughout the document space — a passport. The following sample passport is placed within a white background, simulating a photocopied passport copy.
From this passport image, we want to obtain the following fields:
- First/Given Name
- First/Given Name in Chinese Language Script
- Last/Surname in Chinese Language Script
- Passport Number
To begin, we will import all required packages. The most important packages are OpenCV for computer vision operations and PyTesseract, a python wrapper for the powerful Tesseract OCR engine. The respective documentation pages provide excellent instructions for installing and configuring these libraries.
from cv2 import cv2
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
Next, we will read our passport image using cv2.imread. Our first task is to extract the actual passport document area from this pseudo-scanned page. We will achieve this by detecting the edges of our passport and cropping it out from the image.
img = cv2.imread('images\Passport.png',0)img_copy = img.copy()
img_canny = cv2.Canny(img_copy, 50, 100, apertureSize = 3)
The Canny algorithm contained in the OpenCV library uses a multistage process to detect edges in our image. The last three parameters used are the lower threshold and higher threshold (minVal and maxVal respectively), and the kernel size).
Running the Canny algorithm produces the following output. Note that due to the low thresholds chosen, minimal edges are retained.
img_hough = cv2.HoughLinesP(img_canny, 1, math.pi / 180, 100, minLineLength = 100, maxLineGap = 10)
We next use another algorithm called the Hough Transform on our edge-detected image to map out the shape of our passport area, via detecting lines. The minLineLength parameter defines how many pixels a shape has to contain to be considered a ‘line’, and the maxLineGap parameter denotes the maximum allowable gap in a sequence of pixels to be considered the same shape.
(x, y, w, h) = (np.amin(img_hough, axis = 0)[0,0], np.amin(img_hough, axis = 0)[0,1], np.amax(img_hough, axis = 0)[0,0] - np.amin(img_hough, axis = 0)[0,0], np.amax(img_hough, axis = 0)[0,1] - np.amin(img_hough, axis = 0)[0,1])img_roi = img_copy[y:y+h,x:x+w]
Our passport is bordered on all sides by straight lines — the edges of the document. Thus, having our line information, we can choose to crop our passport area by the outer edges of detected lines:
Finally, we can do some good ol’ OCR!
After rotating our passport upright, we set about selecting the region within our image where we want to capture data. Almost all international passports conform to ICAO standards, which outline specifications for the design and layout of passport pages. One of these specifications is the Machine-Readable Zone (MRZ), those funny two lines at the bottom of your passport document. Most of the key information in the Visual Inspection Zone (VIZ) of your document are also contained in the MRZ, which can be read by a machine. In our exercise, that machine is our trusty Tesseract engine.
img_roi = cv2.rotate(img_roi, cv2.ROTATE_90_COUNTERCLOCKWISE)(height, width) = img_roi.shape
img_roi_copy = img_roi.copy()dim_mrz = (x, y, w, h) = (1, round(height*0.9), width-3, round(height-(height*0.9))-2)img_roi_copy = cv2.rectangle(img_roi_copy, (x, y), (x + w ,y + h),(0,0,0),2)
Let us define the MRZ region in our passport image using four dimensions: horizontal offset (from left), vertical offset (from top), width and height. For the MRZ, we will assume that it is contained within the bottom 10% of our passport. Thus, using OpenCV’s rectangle function, we can draw a box around the region to verify our dimension selection.
img_mrz = img_roi[y:y+h, x:x+w]img_mrz =cv2.GaussianBlur(img_mrz, (3,3), 0)
ret, img_mrz = cv2.threshold(img_mrz,127,255,cv2.THRESH_TOZERO)
mrz = pytesseract.image_to_string(img_mrz, config = '--psm 12')
We are now ready to apply OCR treatment. In our image_to_string attribute, we configure a page segmentation method of ‘Sparse text with Orientation and Script Detection (OSD)’. This aims to capture all available text in our image.
Comparing the Pytesseract output to our original passport image, we can observe some errors in reading special characters. For a more accurate readout, this can be optimized using Pytesseract’s whitelisting configuration; however for our purposes, the accuracy of the current readout is sufficient.
mrz = [line for line in mrz.split('\n') if len(line)>10]if mrz[0:2] == 'P<':
lastname = mrz.split('<')[3:]
lastname = mrz.split('<')[5:]firstname = [i for i in mrz.split('<') if (i).isspace() == 0 and len(i) > 0]pp_no = mrz[:9]
Applying some string manipulation based on ICAO’s guidelines on MRZ code structure, we can extract our passport holder’s Last Name, First Name and Passport number:
What about text that is not in the English language? Not a problem — the Tesseract engine has trained models for over 100 languages (albeit the robustness of OCR performance differs for each supported language).
img_roi_copy = img_roi.copy()dim_lastname_chi = (x, y, w, h) = (455, 1210, 120, 70)
img_lastname_chi = img_roi[y:y+h, x:x+w]
img_lastname_chi = cv2.GaussianBlur(img_lastname_chi, (3,3), 0)
ret, img_lastname_chi = cv2.threshold(img_lastname_chi,127,255,cv2.THRESH_TOZERO)dim_firstname_chi = (x, y, w, h) = (455, 1300, 120, 70)
img_firstname_chi = img_roi[y:y+h, x:x+w]
img_firstname_chi = cv2.GaussianBlur(img_firstname_chi, (3,3), 0)
ret, img_firstname_chi = cv2.threshold(img_firstname_chi,127,255,cv2.THRESH_TOZERO)
Using the same method of region selection, we again define dimensions (x, y, w, h) for our target data fields, and apply blurring and thresholding to the cropped image extract.
lastname_chi = pytesseract.image_to_string(img_lastname_chi, lang = 'chi_sim', config = '--psm 7')firstname_chi = pytesseract.image_to_string(img_firstname_chi, lang = 'chi_sim', config = '--psm 7')
Now, in our image_to_string parameters, we will add the input text’s language script, Simplified Han Chinese.
To complete the exercise, pass all collected fields to a dictionary and output to a table for practical use.
Explicit definition of regions of interest for OCR is but one of many ways of obtaining your desired data in OCR. Depending on your use case, using other approaches such as contour analysis or object detection might be most efficient.
As demonstrated in our passport exercise, proper pre-processing of the image prior to applying OCR is key. When working with real-world documents of varying (and sometimes questionable) image quality, it pays off to experiment with different pre-processing techniques to find a combination that works best for your document type. Have fun!
Script in Python available on Github.