DOCUMENT EXTRACTION

Daisy
Google Cloud - Community
5 min readJun 2, 2023

Document extraction or classification are major use cases in any industry, particularly where major part of the operations still takes place using physical documents. The documents are often highly unstructured, handwritten and requires manual effort for extraction or updating of any process information, increasing the effort and operation of an organization. Information extraction can be a manual overhead in such cases. An Optical character recognition engine (OCR) can play a major role in understanding this set of documents and mining valuable and important information. Further, adding a Machine Learning intelligence layer on top of this OCR extracted text can transform this ordinary text into valuable data.

Tesseract is the most primitive model that is being used in solving most OCR related use cases. Though Tesseract is well established in the field of OCR for extraction of textual information from images, it fails in extraction of data from tables and handwritten data. Also, extraction of custom entities becomes a tedious and time-consuming task when the documents have high uniformity in its structural content. Large amount of custom logic development is required for name entity mapping for such documents. This makes data extraction highly complex, increasing the developmental effort and time. With time, more complex models have evolved in this domain. Easy OCR/Keras OCR/paddle OCR are few of the available advanced options in open source that uses deep learning based recognition together with an extraction model for text extraction. A more efficient way of solving such use cases is by using any of the cloud available services. Google’s Document AI, Amazon Textract, Azure Form Recognizer are few of services that can be adapted for document and custom entity extraction which are playing a major role in industry to generate models for processing large volumes of documents.

Few advantages of using Cloud services

i) Sensitive document information cannot be uploaded to be used for online tools for generating ground truth. Cloud services come with built-in annotation service.

ii) Provides HITL service where wrongly extracted information could be retrained by the model to give correct output.

iii) Can be used in both synchronous and asynchronous modes of extraction.

iv) Requires less development time and effort

v) Capable to process high volume data

vi) Capable of processing multilingual documents

vii) Capable of processing table and handwritten information.

viii) Custom use case specific model could be built on top of the provided model

With less amount of training data.

Few use cases

i) Document summarization-Summarizing the textual content of a document.

ii) Document Classification-Classifying document into its subtypes.

iii) Content mining-extracting field from documents.

iv) Complaint routing-routing of complaints to appropriate CRMs based on risk of complaint.

v) KYC verification-extracting data from KYC documents

vi) Document verification and matching — validating form and supporting document information

vii) Signature Validation

HOW TO BUILD AN E2E DOCUMENT EXTRACTION SOLUTION

The general process flow of a document extraction process consists of the below steps

i) Document Ingestion.

ii) Conversion of your file to Image

iii) Individual Page Segmentation- Segment the document into individual pages as extracting the text per page can help in eliminating duplicate entity pairs.

iv) Image orientation detection and correction-detect orientation of input image and perform rotation

v) Text extraction-use any OCR engine to get the text extracted from the document. A text from the document can be extracted as a line, paragraph or a block.

vi) Field entity value mapping- Field entity value could be mapped by developing custom logic or training a name entity recognition model.

vii) Post extraction correction-OCR are generally prone to generate misclassification of text like A as 4 or 5 as S or I as 1 which should be handled in a correction script after the extraction wherever possible

THE Hocr Extraction for complex documents

HOCR is an HTML format OCR file that is mostly used in extracting the meta information from a PDF file. Parsing a HOCR file helps to locate text in a document using reference entities. Custom logic could be developed to extract entities using the bounding box of an entity surrounded text. HOCR file format can be parsed using any HTML parser.

Below is a sample of a hocr file

The HOCR text is represented as an HTML files with distribution of the lines and words in span tags as ocr_line, ocrx_word. Each extracted word of the document is under a separate span tag.

It also gives other useful information as

i) The corresponding confidence of the extracted word is defined as x_wconf.

ii) Geometrical information of the extracted word with coordinates of the bounding box of the extracted text is represented by the bbox tag.

iii) Structural information of the document layout

iv) Detected language of the extracted document

Selecting a region of interest (ROI) is mostly helpful using HOCR. Passing the ROI instead of the whole image has better extraction output.

DOCUMENT ORIENTATION

Imagine the batch of document ingested do not follow uniform orientation or the scanned documents are not properly oriented. In such scenario, appropriate page segmentation and orientation method is to be implemented. All OCR models have orientation and page segmentation method included in the model as support function for detecting orientation angle of the document. Additionally, use an image registration model before text extraction as a pre-process step to handle image rotation.

Some points to remember

· The image converted should be of minimum 300 dpi

· Adjust the width to height ratio of the image.

· Applying high amount of image processing as pre-processing might degrade the quality of extracted text as every OCR model has a built in image pre-processing.

· Always handle orientation and page segmentation of the document.

· Input a particular ROI of the image to the OCR model instead of sending the whole image as an input to the model.

Let’s build your first OCR solution!!! Happy Learning!!

--

--