Learning from Documents AI Project — Part 1

6 min readJun 22, 2024

Document Processing is an integral part of customer facing business workflows. Typical use cases include customers filling a paper Form and sending the Form via Fax to CRM. The major challenge is digitizing the Form. With the advancements in AI/ML technologies, major Public Cloud service providers have a comprehensive set of solutions for digitizing and processing the documents. I led one such project successfully and I want to share my learning from the project in this blog. Please note that I am not going to share technical or architectural details considering the confidentiality requirements of the project but I am going to discuss some of the high level learning that I feel are important especially for the Program Managers who are managing such a project for the first time. This is Part 1 of the blog which focuses mostly on the planning and analysis. Part 2 focuses on execution.

Use Case

For the purpose of this blog, I have considered a use case of digitizing a paper based Form containing details such as customer demographic details, Income Information and Insurance information. The Form is sent via Fax and rendered as a PDF in CRM. The user tags the pages and clicks the Digitize button which invokes the back end AI models to read and send the digitized text back to CRM. The users are supposed to verify the digitized text and make necessary changes if required.

Learning

Do not consider Document Digitization Project as just another IT project

Traditional IT projects such as implementation of certain workflows, user experiences and system integrations have well defined success criteria, best practices and project implementation frameworks. However Document digitization projects are different in many aspects than traditional IT projects. It is also different from a typical AI/ML project because an AI/ML project involving tabular or image data can use sophisticated tools such as Pre-built training pipeline components, AutoML, and visualization tools such as Tensorboard but Documents AI involves a lot of manual analysis of documents and performance. Some of the key aspects which differentiates Document AI projects from traditional projects and other AI projects are discussed below -

Data Preparation — Training data preparation is a very important aspect of Document digitization project. Depending on the project requirements we may need to implement a separate automated flow for automatically getting the documents from the Production environment, split the documents into pages and use those for training the model. If production documents are not available then training documents need to be prepared manually. The training documents should be as similar as possible to the actual expected documents in production. All these efforts should be considered in the Project plan.

Model Training — Unlike other AI/ML projects, Model training in Documents AI projects involve significant manual efforts of configuring the model and annotating the fields. The efforts depend on the number of models that need to be configured which ultimately depends on the variety of documents that are involved in the workflow. The important point to note here is that training effort is required only for custom OCR models. Most of the service providers offer out of the box models which do not need training or need only little training. The choice of models depends on the fields on the Forms to be digitized.

Don’t underestimate the importance of Document Analysis

Like any AI/ML project, data analysis is a critical aspect of the project. However, unlike typical AI/ML projects, we have to analyse documents manually. Analyzing tabular data is much easier with AutoML platforms or Scikit- learn or Tensorflow libraries but Documents need to be analysed manually. Following are the important points to be considered in Document analysis

Variety of Documents — Variety of documents has a direct impact on the design of back end AI models. Typically the structure of Forms changes based on product, type of customer, region etc. If there are too many variations in the fields on the Forms, separate models have to be configured and trained for specific Forms. The training efforts also involve training data preparation effort. Training a reasonably complex Form which needs a custom extractor model will require approximately 100 training Forms. Effort for preparing this data would further depend on the complexity of Form and whether the Forms are expected in typed or in handwritten format. Analyzing and documenting the variety of Forms early in the project helps in correctly estimating the project.

Typed vs Handwritten Documents — The accuracy of digitization depends on image quality and quality of field values. Typed Forms result in better accuracy than typed forms. It is also easier to prepare training documents for typed forms. Effort for preparing training data increases significantly for Handwritten Forms.

Do the Documents have watermark? — Watermarked text may get superimposed on the field value resulting in OCR model reading the field value incorrectly. If watermarked documents are expected, then it is better to get the non watermarked version as a part of pre-processing and use the same document for digitization. The non watermarked versions are also required for training the models. Do not train the model with watermarked documents. If it is not possible to get the non watermarked versions then try to change the position of watermark so that it does not get superimposed on any field value.

How frequently the Documents are expected to change? — The scope of OCR projects should be limited to a specific version of the document. It is important to understand the frequency of releasing the new versions of the document and how significant is the change in newer versions. If the new versions are significantly different than the earlier version then it would require significant effort in retraining the model or configuring new models. It should also be analysed whether the production environment would involve documents belonging to different versions of a single version. The OCR performance (i.e. accuracy) goals should be set considering the difference in document versions that are expected to be used in production.

Document Fields — The type of fields in the documents also impact the complexity and training efforts and type of models to be used. For example, if the majority of the fields on the form are structured fields such as email, phone, tables, checkboxes then Form Parsers can be a better option because Form parsers do not need training. On the other hand if the Form contains more text fields custom OCR models are required. The choice of Form parser and custom extractor models may differ depending on the service provider’s (AWS, GCP, Azure) offerings. Some service providers like Google are also offering custom processors with Generative AI capabilities which need little or no training. Therefore, effort estimation and timelines depend on the service provider’s offerings and type of fields on the documents.

Document Composition — Document composition refer to following two aspects

Does the document contain any pages which we don’t have to digitize? — The document (PDF) may contain few pages which are supposed to get digitized while other pages may not need digitization. In such a scenario additional logic has to be written for identifying the pages to be digitized and separate these pages from the document.
Does the document contain information about a single customer or multiple customers? — This does not impact the digitization efforts but it does impact post digitization efforts because the system has to associate the digitized information to more than one customer if the document contains information about more than one customer. It is recommended to process information of each customer in a separate digitization API call.

Direct and Conditional Field Population — Do not assume that all fields will be populated directly from the Form. There could be some fields which should be fetched from the CRM database based on certain fields read from the Form. For example, the Form may contain details such as Bank Name, Branch, Branch code of customer’s bank. However, instead of populating all bank fields directly from the Form, it is better to read only branch code and fetch other details from the database by searching the database by branch code. This is a better approach to ensure consistency and accuracy. Such fields should be documented along with the criteria for fetching the fields from the database.

Assessing overall Complexity of the Document — We have discussed many aspects of assessing complexity. Here is a quick summary from an overall program management perspective.

Learning from Documents AI project — Part 2

Learning from Documents AI Project — Part 1

Use Case

Learning

Do not consider Document Digitization Project as just another IT project

Don’t underestimate the importance of Document Analysis

Written by Jaydeep Hardikar