Using the power of Custom Document Extractor of Document AI

Vikas Pandey
Google Cloud - Community
10 min readApr 3, 2023

Introduction to Document AI

Document AI is a cloud-based platform that uses machine learning to extract structured data from unstructured documents. This data can then be analyzed and consumed more easily, making it a valuable tool for businesses of all sizes.

Here are some of the benefits of using Document AI:

  • Improved accuracy: Document AI uses machine learning to extract data with greater accuracy than traditional methods.
  • Reduced time and costs: Document AI can automate document processing tasks, which can save businesses time and money.
  • Improved compliance: Document AI can help businesses comply with regulations by ensuring that data is extracted accurately and consistently.
  • Increased productivity: Document AI can free up employees to focus on other tasks, which can lead to increased productivity.

If you are looking for a way to improve your document processing, Document AI is a great option. It is a powerful tool that can help you save time, money, and improve compliance.

Introduction to Custom Document Extractor

A Custom Document Extractor (CDE) is a machine learning model that can be used to extract trained labels from similar kinds of documents. CDE’s are trained on a dataset of labeled documents, and they can be used to extract labels on new documents with high accuracy.

CDE’s are a powerful tool that can be used to automate document extraction tasks.

Document AI currently supports the following formats for training.

Languages supported by Document Extractor Processor currently

Use Case

In this blog, you will see an example of how you can utilize the capabilities of Document AI’s Custom Document Extractor to extract an interview candidate’s profile information and skills from a resume.

Consider a situation where a Human Resource Professional (HR) has to shortlist candidates and send them invitations from a large collection of resumes stored in a cloud storage bucket.

Without automation, the HR professional would need to access the cloud storage bucket and review the resumes to identify the candidates who have the skills that match the desired skills of the job.

This process can be time-consuming and error-prone. It can be automated with the help of a Custom Document Extractor. This can free up time for other tasks and can help to reduce the number of errors that are made in the process.

Think about your project requirements and the documents that you need to extract structured data from. For example, insurance claims, banking and finance statements, health diagnostic images, travel tickets etc. are all types of documents that can be automated for structured data extraction.

Workflow

Create Custom Document Extractor Processor

Follow the steps from the standard product document link to complete below steps :

  1. Create a Processor
  2. Create a Cloud Storage bucket for the dataset
  3. Import your documents into a dataset

click for the viewing the standard product document

Imported Dataset Information

For creating the dataset for this resume example, sample modern resume documents of 250+ different formats are downloaded from multiple websites for creating the varieties in the training and uploaded the documents in the cloud storage bucket and then imported in the Document AI dataset.

Formats used in this dataset: PDF,JPEG,PNG & WEBP

Note : The model can only be trained on documents with text, so narrow your dataset down to text only.

Sample Dataset Document

You will see how to train the model to extract the labels that are highlighted below.

Creating the Labels

Once you complete the import, you can create labels that will be searched for in the document. You can do this step at any time before you start labeling the documents.

Click on the [EDIT SCHEMA] button on the left pane of the [TRAIN] tab.

Create all the desired labels with the desired datatype & occurrence and then click on [SAVE].

Note : The model can only be trained on documents with text, so narrow your dataset down to text only.

Labeling the newly imported document

In this use case, you will see how to label documents manually. Modern resumes are considered as documents for this dataset.

To start labeling, click on the import documents under [Unlabeled/Auto-labeled].

For documents under the [Unlabeled] pool, which will get populated after the first import of documents in the dataset, you have to create the boundary boxes manually on the desired data on each document and map your label to them.

For documents under the [Auto-Labeled] pool, you will get predicted label’s boundary boxes drawn on the document, which gets created when you import the data via the [Auto-labeling] option. You have to confirm/correct the auto-labeled data and save it. This process saves time and speeds up labeling.

Note: Auto-labeling will not work unless you have at least one trained and deployed version of your model. This option can be used after the first successful training after manual labeling. I will explain auto-labeling in further steps.

In the below example I’ve created a boundary box around skills data and will map label named as skills to it and then post labeling all data on the document click on MARK AS LABELED button to save, post this the document will move from Unlabeled/Auto-labeled pool to Labeled pool.

Note: In case your documents have multiple pages, consider labeling a variety of document lengths, including one-page documents, two-page documents, and up to 15-page documents. This will help your model understand any number of pages of input for prediction.

Assigning documents to Training & Test set

After completing the labeling process, you have to distribute the documents between the Training and Test sets. Consider assigning a minimum of 50 documents with all labels present in each document to both the Training and Test sets.

A minimum of 50 labels in each set is a minimum criterion for training, i.e. you need a minimum of 100 labeled documents for pilot training. Select all the desired documents under the Unassigned pool in the Data Split section and assign them to the Training & Test set.

Note: While assigning documents, consider assigning each kind/format of document to both the sets. This strategy will give you good throughput.

Training the processor

Once your data distribution is completed and meets the minimum criteria, click on the [TRAIN NEW VERSION] button to start training under the [TRAIN] tab.

Before starting the training, you can view the distribution of labels between the Training and Test sets by clicking the [VIEW LABEL STATS] button in the previous screenshot. You can see below that you are getting a green tick in the training guidelines, which is required for a good training.

You can click the [START TRAINING] button once you have satisfied the training requirements. It will take the desired time to train the model, so be patient after starting training.

Training time depends on the complexity and quantity of the documents present in the dataset.

Note: Always try to train with the maximum number of documents possible in the training and test sets. This will save you time and resources from having to train the model multiple times.

Deploying the processor

Deployment of the version is required to test the processor.

You can select the desired Version to deploy under [MANAGE VERSIONS] tab as below :

For our sample use case, I have trained 5 times with more and more resumes to increase the accuracy.

Also, you can also see in the above screenshot how an increase in data quality comes with good labeling. Further continuous training increases the F1 score and efficiency of the model.

Deployment will take a few minutes. After this, you can test the sample document.

Note : Deploy only those trained models that will be used for prediction via API. Every deployed model incurs a hosting charge after deployment. For more details, please refer to the pricing link https://cloud.google.com/document-ai/pricing , also you can explore detailed pricing of the Custom Document Extractor.

Evaluating the processor

Once deployment completes you can select the Version to evaluate the model under [Evaluate & Test] tab

You can see F1 Score ,Precision & Recall should be at the higher side for a good model.

Confusion Matrix understanding for evaluation

This is a standard confusion matrix evaluation metrics summary. For more detailed understanding, you can refer to the links I’ve added in the references for conceptual understanding.

Note : If these model parameters come below your expectation then consider adding more documents in the dataset with quality labeling and training a new version to increase the efficiency of the model.

Testing the processor

In the [EVALUATE & TEST] tab you can upload a sample document of your dataset and check the predictions of the trained labels.

In our use case, two sample resumes were uploaded to evaluate.

Best Prediction Case: All predictions are correct in the screenshot below.

Partial Prediction Case: One label named [skills] was not detected in the below screenshot because the F1 score for this label is 0.655, which is not a good sign for good prediction. This means that the probability of predicting this label correctly is low. All other labels like name, mobile, email, etc. are getting predicted correctly because the F1 score of these labels are near to 1. (Please refer to the screenshot of the “Evaluating the Processor” section above.)

Note : Keep re-training or up-training the model with more documents, and also with different varieties of documents, to increase the F1 score of each label closer to 1. This will help you increase the predictions of your label and increase efficiency.

Auto Labeling

Now that you have a trained model, you can utilize the Auto-Labeling option to reduce the manual effort of labeling in the future.

When you next import your batch of documents, you have the option to enable Import with Auto-Labeling. You have to select a Version from all the deployed versions available in the drop-down menu to be used for auto-labeling.

At the same time, you can select any one of the Data Split options to utilize the documents in any sets.

You can also Auto-label existing imported documents from the [Auto-Label] button, which activates under Auto-labeled pool when you select any document. As shown below, there are multiple options:

Consuming trained version via API’s

You can refer to the detailed code in the link https://cloud.google.com/document-ai/docs/send-request, which is available in multiple languages, such as REST, Java, C#, Python, and Node.js.

You can create your own APIs/Cloud functions on top of this Document API to consume the trained model. Then, you can store the output in any database, such as BigQuery, and make this data available for analytics.

Human-in-the-Loop(HITL)

This feature is a unique feature in which you can introduce a human for verification and corrections before it is used in case of the business critical applications where any kind of predictions which are below expectations cannot be afforded.

Note : For more information, please see the link https://cloud.google.com/document-ai/docs/hitl

Conclusion

Document AI’s Custom Document Extractor Processor is a powerful tool that can help businesses automate manual data extraction tasks, improve data accuracy, enhance the customer experience, and improve compliance.

Consider which of your documents can be automated for data extraction and how this could help you modernize your business.

References

Document AI : https://cloud.google.com/document-ai

Document AI pricing : https://cloud.google.com/document-ai/pricing

ML Concepts : https://towardsdatascience.com/a-look-at-precision-recall-and-f1-score-36b5fd0dd3ec

Feel free to follow with me on LinkedIn and Medium and send me a message if you have any questions, would like to learn more, or have a thought to share. I’ll be in touch.

--

--