Using the efficacy of Custom Document Classifier of Document AI

Vikas Pandey
Google Cloud - Community
9 min readApr 12, 2023

Introduction to Document AI

Document AI is a cloud-based platform that uses machine learning to extract structured data from unstructured documents. This data can then be analyzed and consumed more easily, making it a valuable tool for businesses of all sizes.

Here are some of the benefits of using Document AI:

  • Improved accuracy: Document AI uses machine learning to extract data with greater accuracy than traditional methods.
  • Reduced time and costs: Document AI can automate document processing tasks, which can save businesses time and money.
  • Improved compliance: Document AI can help businesses comply with regulations by ensuring that data is extracted accurately and consistently.
  • Increased productivity: Document AI can free up employees to focus on other tasks, which can lead to increased productivity.

If you are looking for a way to improve your document processing, Document AI is a great option. It is a powerful tool that can help you save time, money, and improve compliance.

What is Classification in Machine Learning?

Classification is a type of Supervised Machine Learning that predicts the category of a new data point based on its features. The model is trained on a dataset of labeled data points, and then it is evaluated on a separate dataset of unlabeled data points. If the model performs well on the evaluation dataset, it can be used to predict the category of new data points.

Here are some examples of classification tasks:

  • Categorizing emails as spam or not spam
  • Identifying objects in images
  • Predicting whether a customer will churn

Classification is a powerful tool that can be used to solve a wide variety of problems. If you have a dataset of labeled data points, you can use classification to build a model that can predict the category of new data points.

Introduction to Custom Document Classifier

A Custom Document Classifier (CDC) is a machine learning model that can be used to classify documents into different categories. CDC’s are trained on a dataset of labeled documents, and they can be used to classify new documents with high accuracy.

CDC’s are a powerful tool that can be used to automate document processing tasks. They can be used to classify invoices, contracts, medical records, legal documents, financial statements and other types of documents.

Languages supported by Document Classifier Processor currently

Note : For detailed insights on languages supported kindly visit https://cloud.google.com/document-ai/docs/languages

Currently Supported Formats

In the following architecture, you can see a Custom Document Classifier that accepts all supported formats for training and predicts the desired trained labels, such as bank statements, birth certificates, and resumes.

Use Case

In this blog, you will see how to create a Custom Document Classifier Processor to identify the type of label of the document, as shown in the previous architecture.

Think about your project requirements, such as which group of documents you can use to classify, e.g., insurance documents, banking and finance statements, health diagnostic images, travel tickets etc.

Workflow

The workflow is the same for all custom document processors, as I have shown in my previous blog on Custom Document Extractor (link in references).

Create Custom Document Classifier Processor

Follow the steps from the standard product document link to complete below steps :

  1. Create a Processor
  2. Create a Cloud Storage bucket for the dataset
  3. Import your documents into a dataset

click for the viewing the standard product document

Imported Dataset Information

To create an exhaustive dataset for this blog, sample documents of modern resumes, birth certificates, and bank statements were downloaded from multiple websites, uploaded to the cloud storage bucket, and then imported into the Document AI dataset.

Formats used in this dataset: JPEG,PNG & WEBP

Note : The model can only be trained on documents with text, so narrow your dataset down to text only.

Sample Dataset Documents for each label

Bank Statement

Birth Certificate

Resume

Creating the Labels

Once you have completed the import, you must create labels that will be used to classify documents. You are free to do this step at any time before you start labeling documents.

Click the [EDIT SCHEMA] button on the left pane of the [TRAIN] tab.

Create all the desired labels for the document, and then click [Save].

Note: Considering appropriate label names for documents at the start of the process is very important. Once you train the model with the labels, they cannot be edited and can only be enabled or disabled.

Labeling the new documents / Auto-Labeling the documents while import

At the start of the project, labeling documents is a manual activity. However, labeling for Custom Document Classifier is super easy.

Click on the imported documents under [Unlabeled/Auto-labeled] to start labeling.

In the case of documents under the [Unlabeled] pool, which will be populated after importing documents in the dataset only if you import under the [NO-LABEL] option, you have to select the desired label for each document in the drop-down as shown below and click on [MARK AS LABELLED].

If you import documents under the [Auto-Labelled] pool, you will get a predicted label mapped to each document. This label is created when you import the data using the [AUTO-LABEL] option. You can confirm or correct the auto-labeled label, and then save it. This process saves time and speeds up labeling.

Note: Auto labeling will not work unless you have at least one trained and deployed version of your model. This option can be utilized after successful training after manual labeling.

You have 3 options for labeling while importing the documents in the dataset

CHOOSE LABEL- You can import selective label documents at once.

AUTO-LABEL- Auto label documents from any trained deployed version.

NO-LABEL- Import all documents without labeling.

Assigning documents to Training & Test set

After completing the labeling process, you have to distribute the documents between the Training and Test sets.

Consider a minimum of 10 labeled documents in the Training set and a minimum of 2 labeled documents in the Test set. You can assign as many more documents as possible to both sets.

Select all the desired documents under the Unassigned pool in the Data Split section and assign them to the Training and Test sets accordingly.

Training the processor

Once you have completed your data distribution with the minimum training criteria, click on the [TRAIN NEW VERSION] button to start training under the [TRAIN] tab.

Before starting the training, you can view the distribution of labels between the Training and Test sets by clicking the [VIEW LABEL STATS] button in the previous screenshot. You can see in the following screenshot that you are getting a green tick in the training guidelines, which is required for a good training.

You can click the [START TRAINING] button once you have satisfied the training requirements. It will take the desired time to train the model, so please be patient after starting training.

Training time depends on the complexity and quantity of the documents present in the dataset.

Note: Always try to train with the maximum number of documents possible in the training and test sets. This will save your time and resources from multiple training.

Deploying the processor

To test the processor, you must deploy the version. You can select the desired version to deploy under the [MANAGE VERSIONS] tab, as shown below.

Deployment will take a few minutes. After that, you can test the sample document.

Note: Deploy only those trained models that will be utilized for prediction via API. Every deployed model incurs a hosting charge after deployment. For more details, please refer to the pricing link https://cloud.google.com/document-ai/pricing. You can also explore the detailed pricing of the Custom Document Classifier.

Evaluating the processor

Once deployment is complete, you can select the version of the model to evaluate under the [Evaluate & Test] tab.

As you can see, the F1 score, Precision, and Recall are all high in our example. This is what we want to see in a good model.

Confusion Matrix understanding for evaluation

Before we dive into the confusion matrix, let’s understand the basic terms used for evaluation: True positive (TP), False positive (FP), False negative (FN), and True negative (TN).

Now that you have an understanding of the basic terms, the below standard confusion matrix can be referred to for a brief understanding.

Testing the processor

In the [EVALUATE & TEST] tab, you can upload a sample document from your dataset by clicking the [UPLOAD TEST DOCUMENT] button. The predictions of the trained labels will be displayed.

Resume Document Testing : The resume was predicted correctly with an F1 score of 1, which is as expected.

Birth Certificate Document Testing : The birth certificate was predicted correctly with an F1 score of 0.997, which is as expected.

Bank Statement Document Testing : The bank statement was predicted correctly with an F1 score of 0.998, which is as expected.

The outputs show that our model is correctly trained with all labels.

Consuming trained version via API’s

You can refer to the detailed code in the link https://cloud.google.com/document-ai/docs/send-request, which is available in multiple languages, such as REST, Java, C#, Python, and Node.js.

You can create your own APIs/Cloud functions on top of this Document API to consume the trained model. Then, you can store the output in any database, such as BigQuery, and make this data available for analytics.

Human-in-the-Loop(HITL)

This feature is a unique feature in which you can introduce a human for verification and corrections before it is used in case of the business critical applications where any kind of predictions which are below expectations cannot be afforded.

Note : For more information, please see the link https://cloud.google.com/document-ai/docs/hitl

Document Classifier Extended Architecture Example

Before concluding, please take a look at an example of how you can use Custom Document Extractor along with Custom Document Classifier to resolve any possible use cases.

In the above example, you can see that after classifying the documents via the Custom Document Classifier API, you can individually call the Custom Document Extractor API for each type of document if extraction is required at this level. You can then store the data in separate tables for each label.

Conclusion

Document AI’s Custom Document Classifier Processor is a valuable tool for businesses that can help them automate manual data classification tasks, improve data accuracy, enhance the customer experience, and improve compliance.

Consider which of your documents can be automated for document classification and how it can help you modernize your business.

References

Document AI : https://cloud.google.com/document-ai

Document AI pricing : https://cloud.google.com/document-ai/pricing

Custom Document Extractor : https://medium.com/google-cloud/utilizing-the-power-of-custom-document-extractor-of-document-ai-2a6b89898a30

Feel free to follow with me on LinkedIn and Medium and send me a message if you have any questions, would like to learn more, or have a thought to share. I’ll be in touch.

[Looking for the latest Google Cloud generative AI news? Check out The Prompt on Transform with Google Cloud.]

--

--