Learning from Documents AI Project Part-2

5 min readJun 22, 2024

In Part 1 of the blog, we discussed how Documents AI Project is different from other IT project and why Document Analysis is important. This is Part 2 of the blog which focuses on Solution Options and Execution. Based on the Analysis of the Document to be digitized, we now have to analyze the Solution Options. All major Cloud Service Providers have wide variety of OCR solutions. Below are some important learning related to analyzing and implementing these solutions.

Analyse all available Solution Options

Document processing projects can involve a variety of use cases, types of documents and infrastructure requirements. Depending on the requirements we can evaluate various solution options based on the following aspects.

Synchronous vs Asynchronous user experience — Synchronous experience would involve the user clicking a button and getting the digitized response in near real time (20–25 seconds). Asynchronous user experience would involve digitizing the documents in batch mode and then users verifying the digitized documents. Decision about Synchronous vs Asynchronous user experience depends on the nature of workflow and user requirements. Generally Synchronous experience would be suitable for low volume workflows involving complex documents. Asynchronous options would be more suitable for high volume workflows involving relatively simple documents.
Custom vs Pretrained models — Major Public Cloud service providers (AWS, Azure and GCP) provide pretrained models to read text from standard documents such as Driving Licence, IRS forms, US Passport. However, most of the use cases involve reading custom Forms which need custom OCR models to be trained. Many use cases involve using both Custom and Pretrained models for different pages in single Form. The effort and timelines differ significantly based on the number of custom models required because custom models need to be trained with organisation specific Forms.
Template based vs Model based Training — For custom models, the training can be template based or model based. However, this may vary depending on the service provider. The template based training is suitable for Forms which are not expected to change very often and do not have variations. Advantage of Template based training is low training document requirement. Typically 10–15 documents should be enough for Template based training. Model based training approach is suitable for Forms which have variations in the layout. Advantage of this approach is it can read Forms with minor variations in the format. The downside of Model based training is that it needs more training data (80–100 documents). Different pages may need different training approaches. Based on Document analysis, the training approach can be decided and training effort and project timelines vary based on the training approach.
Need of Classification Model — Classification model classifies a single page in the document in one of the predefined classes (typically page numbers). The Classification model is required when a document contains multiple pages and each page is to be read by a separate OCR model. In such a scenario, each page of the document is sent to the Classification model and depending on the output of the classification model appropriate OCR model is invoked to digitize the page.
Need of Orchestration — Depending on the use case, the project may involve developing an orchestration layer for pre-processing the document and invoking appropriate OCR model depending on the output of the Classification model. Here, pre-processing refers to splitting the document into individual pages and sending the individual pages to the Classification model. If the use case involves just a single page Form and if there is only one model then Orchestration layer can be avoided but most of the Forms involve multiple pages which need Classification model and multiple Extractor models. In such scenarios an Orchestration layer is a recommended approach.
Choice of Service Provider — Organizations tend to choose incumbent service providers but it is recommended to evaluate offerings of three major service providers (Azure, GCP and AWS) and also consider some niche players such as Rossum.ai, extracta.ai with respect to project requirements.

Training Data Preparation is key to Success

If the project needs custom OCR models, training data preparation takes significant efforts. The documents for training the custom models should be similar to the actual documents expected in production. It is always a good idea to train the model with actual documents in production. Using the actual production documents for training may need a separate automated training flow to be implemented. The separate automated training flow is required when the production documents contain some pages which are needed for training from the document. For example, a PDF document sent for bank account opening may contain a bank account application Form and supporting identity proofs such as driving licence or passport. If the model is to be trained only for digitizing application Form, then an automated workflow has to be established to fetch the complete document, separate the bank application Form, split the pages of bank application Form and store the pages in image format in storage. These images will then be used for training the custom models.

Measure the Performance

Digitizing documents is often the first step of the workflow and accuracy and turnaround time of OCR impact overall workflow. Therefore, measuring performance of the OCR is important. Some of the key performance indicators that should be recorded are as below-

Response Time — Time taken by the OCR to return the digitized text.
Completion Time — Time required for the user to verify and correct the errors in digitized text.
Confidence Score- Confidence score reported by the OCR.
Fields which are frequently getting corrected by the users- This should provide good insights into model retraining requirements.

Conventional Testing approach does not work for Documents AI

The testing of digitized Forms differs significantly from the testing done in traditional projects. The traditional QA approach is about writing test cases and testing it on the data. The success of the test case is well defined because it depends only on the correctness of business logic and UI implemented in the application. On the other hand, in Documents AI project the result of the test case depends on image quality and legibility of the handwriting. Therefore the testing of one test case should be performed on a variety of documents and testing results should be quantified in terms of accuracy percentage and not just whether the test case is passed or failed. The test cases can be written to test the fields belonging to certain sections based on the inputs from business or domain SMEs. Separate test cases should be written to test the fields getting populated from the database (see Direct and Conditional Field Population from Part 1). Test strategy should also document the guidelines for preparing the test documents so that the testing can be performed on documents which are similar to the production documents.

Don't Forget Performance Testing

Performance Testing is important especially when user is expecting synchronous response. If the implementation includes orchestration layer, then it introduces additional complexity. Below are some tips for avoiding Performance issues.

Set realistic expectation of response time with the business. Generally, response time under 30 seconds should be acceptable but it depends on the complexity of the Form and number of fields.

Strike a balance between number of models and Form specific accuracy. Avoid creating a separate model for minor differences in Form pages. More number of models makes the orchestration more complicated which increases the response time.

Understand the peak demand in terms of concurrent digitization requests and document size. Test the system with sufficient buffer capacity.

If orchestration layer is used then separately test the orchestration layer for Performance. If orchestration layer involves heavy document processing then provision sufficient memory capacity to avoid memory issues.